Apache Ignite Topology Snapshot refreshed .Net - ignite

I start the Apache Ingnite node, a server node, and another client node.
My scenario is: Close the client node, and how to update the service node Topology Snapshot at the same time.
Now, the Topology Snapshot is refreshed only when the NodeFailed event is received by the server after 20 seconds.
What method or configuration on the server side can receive the NodeFailed event immediately or refresh the Topology Snapshot?
This is server log:
[09:08:50,522][WARNING][disco-event-worker-#45%ignite-instance-f69c161b-9f38-4576-b52b-ef3077ba3156%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=5f346db2-50fd-4d83-b518-a09690569274, consistentId=5f346db2-50fd-4d83-b518-a09690569274, addrs=ArrayList [0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.40.1, 192.168.50.135, 192.168.65.1], sockAddrs=HashSet [DESKTOP-1BLUS7R/192.168.40.1:0, /[0:0:0:0:0:0:0:1]:0, /127.0.0.1:0, /192.168.65.1:0, /192.168.50.135:0], discPort=0, order=3, intOrder=3, lastExchangeTime=1602810475243, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]
[09:08:50,525][INFO][disco-event-worker-#45%ignite-instance-f69c161b-9f38-4576-b52b-ef3077ba3156%][GridDiscoveryManager] Topology snapshot [ver=5, locNode=f6d3f760, servers=1, clients=0, state=ACTIVE, CPUs=6, offheap=1.5GB, heap=2.0GB]
[09:08:50,525][INFO][disco-event-worker-#45%ignite-instance-f69c161b-9f38-4576-b52b-ef3077ba3156%][GridDiscoveryManager] ^-- Baseline [id=0, size=1, online=1, offline=0]
[

Can reduce the service node attribute ClientFailureDetectionTimeout, increase the server check the frequency of client nodes.The default is 30 seconds.
//
// 摘要:
//Gets or sets the failure detection timeout used by Apache.Ignite.Core.Discovery.Tcp.TcpDiscoverySpi
//and Apache.Ignite.Core.Communication.Tcp.TcpCommunicationSpi for client nodes.
[DefaultValue(typeof(TimeSpan), "00:00:30")]
public TimeSpan ClientFailureDetectionTimeout { get; set; }

Related

Remote Apache Ignite cluster connection failures

I can successfully join and leave a single node Apache Ignite 2.8.1 topology running as Docker container on my local Docker server.
Running the exact same program but on a remote Docker server I can see my program joining the cluster topology but before the connection completes I am getting the following connection error
SEVERE: Failed to send message to remote node [node=TcpDiscoveryNode [id=a239f009-bddd-4a06-845f-abb304850849, consistentId=127.0.0.1,172.17.0.13:42002, addrs=ArrayList [127.0.0.1, 172.17.0.13], sockAddrs=HashSet [/172.17.0.13:42002, /127.0.0.1:42002], discPort=42002, order=1, intOrder=1, lastExchangeTime=1605015503009, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null, partsSizes=null, partHistCntrs=null, err=null, client=true, exchangeStartTime=106333448635300, finishMsg=null, super=GridDhtPartitionsAbstractMessage [exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=dc9a3700-5377-4095-ac2b-31a2cea3d9a5, consistentId=dc9a3700-5377-4095-ac2b-31a2cea3d9a5, addrs=ArrayList [0:0:0:0:0:0:0:1, 10.91.7.30, 127.0.0.1, 192.168.1.81, 192.168.38.1], sockAddrs=HashSet [host.docker.internal/192.168.1.81:0, /0:0:0:0:0:0:0:1:0, GBLG7Y7GH2.mshome.net/192.168.38.1:0, /127.0.0.1:0, GBLG7Y7GH2.enterprisenet.org/10.91.7.30:0], discPort=0, order=2, intOrder=0, lastExchangeTime=1605015498538, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=true], topVer=2, nodeId8=dc9a3700, msg=null, type=NODE_JOINED, tstamp=1605015505481], nodeId=dc9a3700, evt=NODE_JOINED], lastVer=GridCacheVersion [topVer=0, order=1605015496511, nodeOrder=0], super=GridCacheMessage [msgId=1, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], err=null, skipPrepare=false]]]]]
class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a239f009-bddd-4a06-845f-abb304850849, addrs=[/172.17.0.13:42003, /127.0.0.1:42003]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3738)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3458)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3198)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3078)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2918)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2877)
at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2035)
at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:2132)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.sendLocalPartitions(GridDhtPartitionsExchangeFuture.java:2020)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.clientOnlyExchange(GridDhtPartitionsExchangeFuture.java:1436)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:903)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3214)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3063)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a239f009-bddd-4a06-845f-abb304850849, addrs=[/172.17.0.13:42003, /127.0.0.1:42003]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3740)
... 15 more
Caused by: java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3584)
... 15 more
Caused by: java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3584)
... 15 more
In my view the problem relates to the client connection settings, so I tried to increase the client discovery SPI "joinTimeout", "networkTimeout" and "socketTimeout" settings as well as the "connectionTimeout" and "socketWriteTimeout" settings but without success.
You have to set up an AddressResolver for the node running inside the remote Docker container.
Have a look at: https://www.gridgain.com/docs/latest/installation-guide/aws/manual-install-on-ec2#connecting-a-client-node
If you're using Spring configuration, then your config should look something like that:
<property name="addressResolver">
<bean class="org.apache.ignite.configuration.BasicAddressResolver">
<constructor-arg>
<map>
<entry key="172.31.59.27" value="3.93.186.198"/>
</map>
</constructor-arg>
</bean>
</property>
<!-- other properties -->
<!-- Discovery configuration -->
</bean>
Here 172.31.59.27 is an inner IP and 3.93.186.198 is an external IP, that you're connecting to.
Did you open 47500 and 45100 ports both way between your Docker and remote node?

Apache Ignite: getting the error: Getting affinity for topology version earlier than affinity is calculated

I'm running an Apache Ignite .Net 2.7 cluster in the Linux environment in a Kubernetes cluster. The Ignite cluster consists of 5 Ignite nodes running three microservices (2x1st service, 2x2nd service and 1 3rd service). Two of the microservices deploy a couple of Ignite services which call each other.
The cluster start up successfully, discovery works fine and all the nodes are being added into the cluster. But out of the sudden, both instances of a service (2 nodes) fail with the following error:
java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated [locNode=TcpDiscoveryNode [id=76308a3b-221a-4307-b181-bd4e66d82683, addrs=[10.0.0.62, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, product-service-deployment-7dd5496d58-l426m/10.0.0.62:47500], discPort=47500, order=8, intOrder=6, lastExchangeTime=1560283011887, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false], grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=17, minorTopVer=0], head=AffinityTopologyVersion [topVer=18, minorTopVer=0], history=[AffinityTopologyVersion [topVer=9, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=1], AffinityTopologyVersion [topVer=12, minorTopVer=0], AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion [topVer=16, minorTopVer=0], AffinityTopologyVersion [topVer=18, minorTopVer=0]]]
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:712)
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:612)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.nodesByPartition(GridCacheAffinityManager.java:226)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByPartition(GridCacheAffinityManager.java:266)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:257)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:281)
at org.apache.ignite.internal.processors.service.GridServiceProcessor$TopologyListener$1.run0(GridServiceProcessor.java:1877)
at org.apache.ignite.internal.processors.service.GridServiceProcessor$DepRunnable.run(GridServiceProcessor.java:2064)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This causes the other service to fail because it depends on the first service:
Unhandled Exception: Apache.Ignite.Core.Services.ServiceInvocationException: Proxy method invocation failed with an exception. Examine InnerException for details. ---> Apache.Ignite.Core.Common.IgniteException: Failed to find deployed service: ProductService ---> Apache.Ignite.Core.Common.JavaException: class org.apache.ignite.IgniteException: Failed to find deployed service: ProductService
Since the second service is being restarted by Kubernetes, the first service reports constant topology changes:
[19:57:14] Topology snapshot [ver=20, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:15] Topology snapshot [ver=21, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:17] Topology snapshot [ver=22, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:49] Topology snapshot [ver=23, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:50] Topology snapshot [ver=24, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:56] Topology snapshot [ver=25, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:58] Topology snapshot [ver=26, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:58:41] Topology snapshot [ver=27, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
Immediately before I identified this problem, I run a minor reconfiguration of the Kubernetes cluster which did not cause pod restarts. Not sure if it could be the cause of the condition in question.
Is it a known problem that has a solution? What should I check (particularly in the logs) which could cast the light on this situation?
Thank you!
Getting affinity for topology version earlier than affinity is calculated error is caused by a known issue. Here is a JIRA ticket for it: https://issues.apache.org/jira/browse/IGNITE-8098
No negative effects from this issue has been noticed so far, so pod failures are probably caused by something else.
In Ignite 2.8 there won't be such issue, since implementation of service processor was reworked completely. Here is the related IEP: https://cwiki.apache.org/confluence/display/IGNITE/IEP-17%3A+Oil+Change+in+Service+Grid

apache ignite node not able to join in the cluster

I'm using apacheignite:2.5.0 docker image deployed in 2 different
ec2-instances and using static IP finder config below is the config file, one of the node is unable to join in the cluster. I have attached logs also please find
below its accepting connection and disconnecting , i ran docker container with --net=host so conatainer attach all ports to host machine and all ports are opened in security group
#
**>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:util="http://www.springframework.org/schema/util"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/util
http://www.springframework.org/schema/util/spring-util.xsd">
<bean abstract="false" id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>34.241.10.9:47500</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
</beans>**
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.18.0.1, 172.31.29.3], sockAddrs=[/172.31.29.3:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.18.0.1:47500], discPort=47500, order=312, intOrder=157, lastExchangeTime=1529067545288, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Topology snapshot [ver=312, servers=2, clients=0, CPUs=6, offheap=3.8GB, heap=2.0GB]
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Data Regions Configured:
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=710.0 MiB, persistenceEnabled=false]
[12:59:25,309][INFO][exchange-worker-#38][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, customEvt=null, allowMerge=true]
[12:59:25,309][WARNING][disco-event-worker-#37][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.18.0.1, 172.31.29.3], sockAddrs=[/172.31.29.3:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.18.0.1:47500], discPort=47500, order=312, intOrder=157, lastExchangeTime=1529067545288, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:59:25,310][INFO][exchange-worker-#38][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], waitTime=0ms, futInfo=NA]
[12:59:25,310][INFO][exchange-worker-#38][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], crd=true]
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] Topology snapshot [ver=313, servers=1, clients=0, CPUs=2, offheap=0.69GB, heap=1.0GB]
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] Data Regions Configured:
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=710.0 MiB, persistenceEnabled=false]
[12:59:25,310][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=312, minorTopVer=0]]
[12:59:25,311][INFO][disco-event-worker-#37][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=312, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=313, minorTopVer=0], evt=NODE_FAILED, evtNode=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, evtNodeClient=false]
[12:59:25,311][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=313, minorTopVer=0]]
[12:59:25,311][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=313, minorTopVer=0], err=null]
[12:59:25,312][INFO][exchange-worker-#38][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=313, minorTopVer=0], evt=NODE_JOINED, node=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15]
[12:59:25,315][INFO][grid-timeout-worker-#23][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=225f750c, uptime=01:42:00.504]
^-- H/N/C [hosts=1, nodes=1, CPUs=2]
^-- CPU [cur=0.17%, avg=0.4%, GC=0%]
^-- PageMemory [pages=200]
^-- Heap [used=73MB, free=92.47%, comm=981MB]
^-- Non heap [used=53MB, free=96.47%, comm=55MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=6, qSize=0]
^-- System thread pool [active=0, idle=8, qSize=0]
[12:59:25,320][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/34.241.7.9, rmtPort=53627]
[12:59:25,320][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/34.241.7.9, rmtPort=53627]
[12:59:25,320][INFO][tcp-disco-sock-reader-#628][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/34.241.7.9:53627, rmtPort=53627]
[12:59:25,325][INFO][tcp-disco-sock-reader-#628][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/34.241.7.9:53627, rmtPort=53627
[12:59:30,332][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/34.241.7.9, rmtPort=50418]
[12:59:30,332][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/34.241.7.9, rmtPort=50418]
[12:59:30,332][INFO][tcp-disco-sock-reader-#629][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/34.241.7.9:50418, rmtPort=50418]
[12:59:30,334][INFO][tcp-disco-sock-reader-#629][TcpDiscoverySpi] Finished
2nd ignite node logs
[12:13:12,850][INFO][main][TcpCommunicationSpi] Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[12:13:12,869][WARNING][main][TcpCommunicationSpi] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:13:12,888][WARNING][main][NoopCheckpointSpi] Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[12:13:12,918][WARNING][main][GridCollisionManager] Collision resolution is disabled (all jobs will be activated upon arrival).
[12:13:12,919][INFO][main][IgniteKernal] Security status [authentication=off, tls/ssl=off]
[12:13:13,275][INFO][main][ClientListenerProcessor] Client connector processor has started on TCP port 10800
[12:13:13,328][INFO][main][GridTcpRestProtocol] Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
[12:13:13,369][INFO][main][IgniteKernal] Non-loopback local IPs: 172.17.0.1, 172.18.0.1, 172.31.29.3, fe80:0:0:0:10f0:92ff:fea1:d09f%vethee2519f, fe80:0:0:0:42:19ff:fe73:ee80%docker_gwbridge, fe80:0:0:0:42:e6ff:fe14:144a%docker0, fe80:0:0:0:4b3:6ff:fe01:7ee0%eth0, fe80:0:0:0:64f4:8bff:fe83:7e97%vethdae9948, fe80:0:0:0:9474:a1ff:fe6b:3368%vethcb2500f
[12:13:13,370][INFO][main][IgniteKernal] Enabled local MACs: 02421973EE80, 0242E614144A, 06B306017EE0, 12F092A1D09F, 66F48B837E97, 9674A16B3368
[12:13:13,429][INFO][main][TcpDiscoverySpi] Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15]
[12:13:18,555][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:18:20,925][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:23:22,710][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:28:23,988][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:33:25,004][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:38:25,815][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:43:26,831][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:48:27,916][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
If you are using same config file for starting 2 nodes, then try to use localPortRange in DiscoverySpi.

Apache ignite node not able to join grid

I'm using static ipfinder configuration installed 2 ignite docker container in 2 different ec2 instances
but nodes not able to join each other below are logs
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=46, servers=2, clients=0, CPUs=6, offheap=3.8GB, heap=2.0GB]
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] Data Regions Configured:
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=3.1 GiB, persistenceEnabled=false]
[07:40:10,697][INFO][exchange-worker-#42][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=05bece82-1950-4fc0-a58e-c062ad4e9b18, customEvt=null, allowMerge=true]
[07:40:10,697][INFO][exchange-worker-#42][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], waitTime=0ms, futInfo=NA]
[07:40:10,697][INFO][exchange-worker-#42][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], crd=true]
[07:40:10,697][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=05bece82-1950-4fc0-a58e-c062ad4e9b18, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.19.0.1, 192.168.1.202], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.19.0.1:47500, /192.168.1.202:47500], discPort=47500, order=46, intOrder=24, lastExchangeTime=1529048390669, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=47, servers=1, clients=0, CPUs=4, offheap=3.1GB, heap=1.0GB]
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] Data Regions Configured:
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=3.1 GiB, persistenceEnabled=false]
[07:40:10,699][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=46, minorTopVer=0]]
[07:40:10,699][INFO][disco-event-worker-#41][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=46, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=47, minorTopVer=0], evt=NODE_FAILED, evtNode=05bece82-1950-4fc0-a58e-c062ad4e9b18, evtNodeClient=false]
[07:40:10,699][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0]]
[07:40:10,700][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0], err=null]
[07:40:10,701][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/53.247.167.223, rmtPort=50787]
[07:40:10,701][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/53.247.167.223, rmtPort=50787]
[07:40:10,701][INFO][tcp-disco-sock-reader-#133][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/53.247.167.223:50787, rmtPort=50787]
[07:40:10,702][INFO][exchange-worker-#42][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=47, minorTopVer=0], evt=NODE_JOINED, node=05bece82-1950-4fc0-a58e-c062ad4e9b18]
[07:40:10,704][INFO][tcp-disco-sock-reader-#133][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/53.247.167.223:50787, rmtPort=50787
You can forward 1st container's host name to the ignite node of 2nd container via a system environment variable in your ignite configuration:
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>#{systemEnvironment['IGNITE_HOST'] ?: '127.0.0.1'}:47500..47509</value>
</list>
</property>
</bean>
An example of docker-compose.yml for 2 communicated ignite services:
version: "3"
services:
ignite:
image: image_name1
networks:
- net
face:
image: image_name2
depends_on:
- ignite
networks:
- net
environment:
IGNITE_HOST: 'ignite'
The ignite node of 'face' can connect to the another ignite node of 'ignite' using the address ignite:47500..47509
Try use internal IP addresses as from this answer http://apache-ignite-users.70518.x6.nabble.com/Ignite-docker-container-not-able-to-join-in-cluster-td22080.html

GridGain network connection: Is it possible to forward a node via SSH?

I would like to SSH into a remote machine running a gridgain instance and connect to it from a local gridgain instance. Can this be done?
How is the gridgain network connection being done? As far as I could sse the node spins up and listens on the first available port on 47100-47200. But it opens some more ports too.
It seems not be sufficient to just e.g. forward 47100 on the remote machine (the remote machines gridgain port) to local 47100. Probably the communication is not just client server but symmetrical with the remote node trying to connect to my home node?
Is there documentation on the network protocol?
I tried a symetrically forwarding the
GridTcpCommunicationSpi.DFLT_PORTs (47100+) and
GridTcpDiscoverySpi.DFLT_PORTs (47500+)
ports.
The nodes are able to connect. On the local node I first get this warning:
WARN GridTcpCommunicationSpi - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=/10.240.136.167:47100]
WARN GridTcpDiscoverySpi - Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout. Current timeout: 5000.
WARN GridDhtPreloader - <gg-utility-sys-cache> Failed to wait for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock.
^-- Long running transactions (ignore if this is the case).
^-- Unreleased explicit locks.
WARN GridTcpDiscoverySpi - Timed out waiting for message to be read (most probably, the reason is in long GC pauses on remote node. Current timeout: 5000.
This is a timeout when somehow trying to connect to connect to 10.240.136.167:47100 - which is the remote machines local IP, which is obviously impossible.
But it looks nice as I get the following:
INFO GridDiscoveryManager - Topology snapshot [ver=2, nodes=2, CPUs=6, heap=2.7GB]
On executing the following broadcast test:
grid.compute().broadcast(new GridRunnable() {
#Override
public void run() {
System.out.println("hello!");
}
});
I get this fatal error on the remote machine, whatever it may be:
[SEVERE][gridgain-#9%pub-null%][GridJobProcessor] Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$Test, taskClsName=at$
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[19:58:02,237][SEVERE][gridgain-#11%pub-null%][GridJobProcessor] Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$1, taskClsName=at.a$
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
class org.gridgain.grid.GridDeploymentException: Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$1, taskClsName=at.ac.ait.is.infrase$
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1107)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
On the client side I don't see anything but:
INFO GridDeploymentLocalStore - Class locally deployed: class nix.GoogleGridRun$1
hello!
When I try to push the broadcast again via the debugger, then I get the following on the local machine and the same error message as before on the remote machine:
ERROR GridTaskWorker - Failed to obtain remote job result policy for result from GridComputeTask.result(..) method (will fail the whole task): GridJobResultImpl [job=o.g.g.kernal.processors.closure.GridClosureProcessor$10#7e89183d, sib=GridJobSiblingImpl [sesId=4c17983b841-43f8b9fa-87ae-4a20-99a1-8d36f5eb74a4, jobId=0d17983b841-ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, nodeId=ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, isJobDone=false], jobCtx=GridJobContextImpl [jobId=0d17983b841-ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, attrs={}], node=GridTcpDiscoveryNode [id=ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, addrs=[10.240.136.167, 127.0.0.1], sockAddrs=[/10.240.136.167:47500, /10.240.136.167:47500, /127.0.0.1:47500], discPort=47500, order=1, loc=false, ver=6.5.0#20140925-sha1:6dc3d773], ex=class o.g.g.GridDeploymentException: Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$Test, taskClsName=nix.GoogleGridRun$Test, codeVer=0, clsLdrId=eb17983b841-43f8b9fa-87ae-4a20-99a1-8d36f5eb74a4, seqNum=1411761402302, depMode=SHARED, dep=null]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
, hasRes=true, isCancelled=false, isOccupied=true]
class org.gridgain.grid.GridException: Remote job threw user exception (override or implement GridComputeTask.result(..) method if you would like to have automatic failover for this exception).
at org.gridgain.grid.compute.GridComputeTaskAdapter.result(GridComputeTaskAdapter.java:109)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker$3.apply(GridTaskWorker.java:819)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker$3.apply(GridTaskWorker.java:812)
at org.gridgain.grid.util.GridUtils.wrapThreadLoader(GridUtils.java:6093)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.result(GridTaskWorker.java:812)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:708)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:906)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:1138)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: class org.gridgain.grid.GridDeploymentException: Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$Test, taskClsName=nix.GoogleGridRun$Test, codeVer=0, clsLdrId=eb17983b841-43f8b9fa-87ae-4a20-99a1-8d36f5eb74a4, seqNum=1411761402302, depMode=SHARED, dep=null]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1107)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
On the local host side I have connections between the virtual and real ports
tcp6 0 0 127.0.0.1:47100 127.0.0.1:38272 VERBUNDEN 12280/java
tcp6 0 0 127.0.0.1:38272 127.0.0.1:47100 VERBUNDEN 12280/java
And some more to and from the ssh client (also java)
tcp6 45832 0 78.101.12.107:47101 146.148.119.62:51867 VERBUNDEN 12280/java
tcp6 231 0 78.101.12.107:47501 146.148.119.62:46219 CLOSE_WAIT 12280/java
tcp6 48 0 78.101.12.107:37129 146.148.119.62:22 VERBUNDEN 12280/java
tcp6 1 0 78.101.12.107:47501 146.148.119.62:44391 CLOSE_WAIT 12280/java
78.101.12.107 = local ip
146.148.119.62 = remote ip
I looked at netstat on a successful local 2 node grid I see the following connections being made:
tcp6 0 0 ::1:47501 ::1:43143 VERBUNDEN 10218/java
tcp6 0 0 ::1:47500 ::1:34708 VERBUNDEN 9496/java
tcp6 0 0 ::1:34708 ::1:47500 VERBUNDEN 10218/java
tcp6 0 0 ::1:43143 ::1:47501 VERBUNDEN 9496/java
These are between the GridTcpCommunicationSpi.DFLT_PORTs and GridTcpDiscoverySpi.DFLT_PORTs - so these should maybe be enough.
Any Ideas on what could be wrong?
Home node should be available from cluster as well. You have 2 options:
Setup VPN
Implement and configure GridAddressResolver for all nodes which will turn their local addresses to external addresses. This will require to setup port forwarding in your home network.