Ignite topology not stable between nodes through LAN switch - ignite

I'm setting up a Apache Ignite cluster and have difficulties keeping the topology alive when more than two nodes connect that are connected through a LAN switch.
There are many warnings and problems reported in the log but I wonder what are the correct steps for me to start trying isolate the problem? Ping in both directions works fine, also after some 30s or 1m the connection works but they also lose each other again often. Sometimes the 3rd node trying to connect causes the whole cluster to fail.
[20:41:34,761][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).
[20:41:34,761][INFO][tcp-disco-sock-reader-#28][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.10.161:34361, rmtPort=34361
[20:41:34,762][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=dd44ea86-5302-47a0-b3c0-86acdcf7e771, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.10.162], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, node_2/192.168.10.162:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1524656494760, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,764][INFO][tcp-disco-sock-reader-#14][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.10.1:55641, rmtPort=55641
[20:41:34,766][WARNING][disco-event-worker-#161][GridDiscoveryManager] Stopping local node according to configured segmentation policy.
[20:41:34,767][WARNING][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=379eb246-e111-4510-a3f6-09554667d769, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.10.161], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.161:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1524656073909, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,768][INFO][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=6, servers=2, clients=0, CPUs=60, heap=2.0GB]
[20:41:34,770][WARNING][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1524656176508, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,770][INFO][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=7, servers=1, clients=0, CPUs=56, heap=1.0GB]
[20:41:34,771][INFO][Thread-3][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][Thread-3][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=5, minorTopVer=0]]
[20:41:34,774][INFO][Thread-3][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][disco-event-worker-#161][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=5, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=379eb246-e111-4510-a3f6-09554667d769, evtNodeClient=false]
[20:41:34,774][INFO][disco-event-worker-#161][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=5, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=dd64661b-0679-4a14-9440-d876e5c35bd5, evtNodeClient=false]
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, minorTopVer=0]]
[20:41:34,787][INFO][Thread-3][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
[20:41:34,803][INFO][Thread-3][IgniteKernal]
>>> +---------------------------------------------------------------------------------+
>>> Ignite ver. 2.3.0#20171028-sha1:8add7fd5b501b40658096cdde48af9e948aa8150 stopped OK
>>> +---------------------------------------------------------------------------------+
>>> Grid uptime: 00:07:08.412
[root#node_2 apache-ignite-fabric-2.3.0-bin]# packet_write_wait: Connection to 192.168.10.162 port 22: Broken pipe
On one of the other nodes something like this is shown after some time:
[22:45:54,026][SEVERE][grid-nio-worker-tcp-comm-6-#127][TcpCommunicationSpi] Failed to process selector key [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=6, bytesRcvd=1578, bytesSent=5266, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-6, igniteInstanceName=null, finished=false, hashCode=733187042, interrupted=false, runner=grid-nio-worker-tcp-comm-6-#127]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=4, resendCnt=0, rcvCnt=4, sentCnt=4, reserved=true, lastAck=4, nodeLeft=false, node=TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1524656494855, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=4, resendCnt=0, rcvCnt=4, sentCnt=4, reserved=true, lastAck=4, nodeLeft=false, node=TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1524656494855, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/192.168.10.161:47100, rmtAddr=/192.168.10.1:47884, createTime=1524656504308, closeTime=0, bytesSent=5266, bytesRcvd=1578, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1524663359458, lastSndTime=1524656672249, lastRcvTime=1524663359458, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser#32244b13, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1233)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2272)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2048)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1717)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
[22:45:54,027][WARNING][grid-nio-worker-tcp-comm-6-#127][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Connection reset by peer]
[22:46:41,002][INFO][grid-timeout-worker-#119][IgniteKernal]
Any idea where I should start looking for the cause of the problem?
Thanks!

As suggested in the warning
[20:41:34,761][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).
the reason is likely a network issue. Pings may work fine (although I'd check failure rate over a long enough interval, like 10-15 minutes), but try also a long-running TCP connection (maybe via a netcat or something).
Another possible reason is high load on the nodes. E.g. if a node goes into a stop-the-world GC and is unable to respond for a long time, it may also be kicked out of the cluster.
To make the cluster more tolerant to short-time network and responsiveness issues, try increasing IgniteConfiguration.failureDetectionTimeout setting.

Related

Remote Apache Ignite cluster connection failures

I can successfully join and leave a single node Apache Ignite 2.8.1 topology running as Docker container on my local Docker server.
Running the exact same program but on a remote Docker server I can see my program joining the cluster topology but before the connection completes I am getting the following connection error
SEVERE: Failed to send message to remote node [node=TcpDiscoveryNode [id=a239f009-bddd-4a06-845f-abb304850849, consistentId=127.0.0.1,172.17.0.13:42002, addrs=ArrayList [127.0.0.1, 172.17.0.13], sockAddrs=HashSet [/172.17.0.13:42002, /127.0.0.1:42002], discPort=42002, order=1, intOrder=1, lastExchangeTime=1605015503009, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null, partsSizes=null, partHistCntrs=null, err=null, client=true, exchangeStartTime=106333448635300, finishMsg=null, super=GridDhtPartitionsAbstractMessage [exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=dc9a3700-5377-4095-ac2b-31a2cea3d9a5, consistentId=dc9a3700-5377-4095-ac2b-31a2cea3d9a5, addrs=ArrayList [0:0:0:0:0:0:0:1, 10.91.7.30, 127.0.0.1, 192.168.1.81, 192.168.38.1], sockAddrs=HashSet [host.docker.internal/192.168.1.81:0, /0:0:0:0:0:0:0:1:0, GBLG7Y7GH2.mshome.net/192.168.38.1:0, /127.0.0.1:0, GBLG7Y7GH2.enterprisenet.org/10.91.7.30:0], discPort=0, order=2, intOrder=0, lastExchangeTime=1605015498538, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=true], topVer=2, nodeId8=dc9a3700, msg=null, type=NODE_JOINED, tstamp=1605015505481], nodeId=dc9a3700, evt=NODE_JOINED], lastVer=GridCacheVersion [topVer=0, order=1605015496511, nodeOrder=0], super=GridCacheMessage [msgId=1, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], err=null, skipPrepare=false]]]]]
class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a239f009-bddd-4a06-845f-abb304850849, addrs=[/172.17.0.13:42003, /127.0.0.1:42003]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3738)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3458)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3198)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3078)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2918)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2877)
at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2035)
at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:2132)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.sendLocalPartitions(GridDhtPartitionsExchangeFuture.java:2020)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.clientOnlyExchange(GridDhtPartitionsExchangeFuture.java:1436)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:903)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3214)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3063)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a239f009-bddd-4a06-845f-abb304850849, addrs=[/172.17.0.13:42003, /127.0.0.1:42003]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3740)
... 15 more
Caused by: java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3584)
... 15 more
Caused by: java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3584)
... 15 more
In my view the problem relates to the client connection settings, so I tried to increase the client discovery SPI "joinTimeout", "networkTimeout" and "socketTimeout" settings as well as the "connectionTimeout" and "socketWriteTimeout" settings but without success.
You have to set up an AddressResolver for the node running inside the remote Docker container.
Have a look at: https://www.gridgain.com/docs/latest/installation-guide/aws/manual-install-on-ec2#connecting-a-client-node
If you're using Spring configuration, then your config should look something like that:
<property name="addressResolver">
<bean class="org.apache.ignite.configuration.BasicAddressResolver">
<constructor-arg>
<map>
<entry key="172.31.59.27" value="3.93.186.198"/>
</map>
</constructor-arg>
</bean>
</property>
<!-- other properties -->
<!-- Discovery configuration -->
</bean>
Here 172.31.59.27 is an inner IP and 3.93.186.198 is an external IP, that you're connecting to.
Did you open 47500 and 45100 ports both way between your Docker and remote node?

Apache Ignite: getting the error: Getting affinity for topology version earlier than affinity is calculated

I'm running an Apache Ignite .Net 2.7 cluster in the Linux environment in a Kubernetes cluster. The Ignite cluster consists of 5 Ignite nodes running three microservices (2x1st service, 2x2nd service and 1 3rd service). Two of the microservices deploy a couple of Ignite services which call each other.
The cluster start up successfully, discovery works fine and all the nodes are being added into the cluster. But out of the sudden, both instances of a service (2 nodes) fail with the following error:
java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated [locNode=TcpDiscoveryNode [id=76308a3b-221a-4307-b181-bd4e66d82683, addrs=[10.0.0.62, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, product-service-deployment-7dd5496d58-l426m/10.0.0.62:47500], discPort=47500, order=8, intOrder=6, lastExchangeTime=1560283011887, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false], grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=17, minorTopVer=0], head=AffinityTopologyVersion [topVer=18, minorTopVer=0], history=[AffinityTopologyVersion [topVer=9, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=1], AffinityTopologyVersion [topVer=12, minorTopVer=0], AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion [topVer=16, minorTopVer=0], AffinityTopologyVersion [topVer=18, minorTopVer=0]]]
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:712)
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:612)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.nodesByPartition(GridCacheAffinityManager.java:226)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByPartition(GridCacheAffinityManager.java:266)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:257)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:281)
at org.apache.ignite.internal.processors.service.GridServiceProcessor$TopologyListener$1.run0(GridServiceProcessor.java:1877)
at org.apache.ignite.internal.processors.service.GridServiceProcessor$DepRunnable.run(GridServiceProcessor.java:2064)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This causes the other service to fail because it depends on the first service:
Unhandled Exception: Apache.Ignite.Core.Services.ServiceInvocationException: Proxy method invocation failed with an exception. Examine InnerException for details. ---> Apache.Ignite.Core.Common.IgniteException: Failed to find deployed service: ProductService ---> Apache.Ignite.Core.Common.JavaException: class org.apache.ignite.IgniteException: Failed to find deployed service: ProductService
Since the second service is being restarted by Kubernetes, the first service reports constant topology changes:
[19:57:14] Topology snapshot [ver=20, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:15] Topology snapshot [ver=21, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:17] Topology snapshot [ver=22, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:49] Topology snapshot [ver=23, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:50] Topology snapshot [ver=24, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:56] Topology snapshot [ver=25, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:58] Topology snapshot [ver=26, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:58:41] Topology snapshot [ver=27, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
Immediately before I identified this problem, I run a minor reconfiguration of the Kubernetes cluster which did not cause pod restarts. Not sure if it could be the cause of the condition in question.
Is it a known problem that has a solution? What should I check (particularly in the logs) which could cast the light on this situation?
Thank you!
Getting affinity for topology version earlier than affinity is calculated error is caused by a known issue. Here is a JIRA ticket for it: https://issues.apache.org/jira/browse/IGNITE-8098
No negative effects from this issue has been noticed so far, so pod failures are probably caused by something else.
In Ignite 2.8 there won't be such issue, since implementation of service processor was reworked completely. Here is the related IEP: https://cwiki.apache.org/confluence/display/IGNITE/IEP-17%3A+Oil+Change+in+Service+Grid

apache ignite node not able to join in the cluster

I'm using apacheignite:2.5.0 docker image deployed in 2 different
ec2-instances and using static IP finder config below is the config file, one of the node is unable to join in the cluster. I have attached logs also please find
below its accepting connection and disconnecting , i ran docker container with --net=host so conatainer attach all ports to host machine and all ports are opened in security group
#
**>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:util="http://www.springframework.org/schema/util"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/util
http://www.springframework.org/schema/util/spring-util.xsd">
<bean abstract="false" id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>34.241.10.9:47500</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
</beans>**
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.18.0.1, 172.31.29.3], sockAddrs=[/172.31.29.3:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.18.0.1:47500], discPort=47500, order=312, intOrder=157, lastExchangeTime=1529067545288, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Topology snapshot [ver=312, servers=2, clients=0, CPUs=6, offheap=3.8GB, heap=2.0GB]
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Data Regions Configured:
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=710.0 MiB, persistenceEnabled=false]
[12:59:25,309][INFO][exchange-worker-#38][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, customEvt=null, allowMerge=true]
[12:59:25,309][WARNING][disco-event-worker-#37][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.18.0.1, 172.31.29.3], sockAddrs=[/172.31.29.3:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.18.0.1:47500], discPort=47500, order=312, intOrder=157, lastExchangeTime=1529067545288, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:59:25,310][INFO][exchange-worker-#38][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], waitTime=0ms, futInfo=NA]
[12:59:25,310][INFO][exchange-worker-#38][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], crd=true]
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] Topology snapshot [ver=313, servers=1, clients=0, CPUs=2, offheap=0.69GB, heap=1.0GB]
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] Data Regions Configured:
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=710.0 MiB, persistenceEnabled=false]
[12:59:25,310][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=312, minorTopVer=0]]
[12:59:25,311][INFO][disco-event-worker-#37][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=312, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=313, minorTopVer=0], evt=NODE_FAILED, evtNode=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, evtNodeClient=false]
[12:59:25,311][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=313, minorTopVer=0]]
[12:59:25,311][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=313, minorTopVer=0], err=null]
[12:59:25,312][INFO][exchange-worker-#38][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=313, minorTopVer=0], evt=NODE_JOINED, node=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15]
[12:59:25,315][INFO][grid-timeout-worker-#23][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=225f750c, uptime=01:42:00.504]
^-- H/N/C [hosts=1, nodes=1, CPUs=2]
^-- CPU [cur=0.17%, avg=0.4%, GC=0%]
^-- PageMemory [pages=200]
^-- Heap [used=73MB, free=92.47%, comm=981MB]
^-- Non heap [used=53MB, free=96.47%, comm=55MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=6, qSize=0]
^-- System thread pool [active=0, idle=8, qSize=0]
[12:59:25,320][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/34.241.7.9, rmtPort=53627]
[12:59:25,320][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/34.241.7.9, rmtPort=53627]
[12:59:25,320][INFO][tcp-disco-sock-reader-#628][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/34.241.7.9:53627, rmtPort=53627]
[12:59:25,325][INFO][tcp-disco-sock-reader-#628][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/34.241.7.9:53627, rmtPort=53627
[12:59:30,332][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/34.241.7.9, rmtPort=50418]
[12:59:30,332][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/34.241.7.9, rmtPort=50418]
[12:59:30,332][INFO][tcp-disco-sock-reader-#629][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/34.241.7.9:50418, rmtPort=50418]
[12:59:30,334][INFO][tcp-disco-sock-reader-#629][TcpDiscoverySpi] Finished
2nd ignite node logs
[12:13:12,850][INFO][main][TcpCommunicationSpi] Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[12:13:12,869][WARNING][main][TcpCommunicationSpi] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:13:12,888][WARNING][main][NoopCheckpointSpi] Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[12:13:12,918][WARNING][main][GridCollisionManager] Collision resolution is disabled (all jobs will be activated upon arrival).
[12:13:12,919][INFO][main][IgniteKernal] Security status [authentication=off, tls/ssl=off]
[12:13:13,275][INFO][main][ClientListenerProcessor] Client connector processor has started on TCP port 10800
[12:13:13,328][INFO][main][GridTcpRestProtocol] Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
[12:13:13,369][INFO][main][IgniteKernal] Non-loopback local IPs: 172.17.0.1, 172.18.0.1, 172.31.29.3, fe80:0:0:0:10f0:92ff:fea1:d09f%vethee2519f, fe80:0:0:0:42:19ff:fe73:ee80%docker_gwbridge, fe80:0:0:0:42:e6ff:fe14:144a%docker0, fe80:0:0:0:4b3:6ff:fe01:7ee0%eth0, fe80:0:0:0:64f4:8bff:fe83:7e97%vethdae9948, fe80:0:0:0:9474:a1ff:fe6b:3368%vethcb2500f
[12:13:13,370][INFO][main][IgniteKernal] Enabled local MACs: 02421973EE80, 0242E614144A, 06B306017EE0, 12F092A1D09F, 66F48B837E97, 9674A16B3368
[12:13:13,429][INFO][main][TcpDiscoverySpi] Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15]
[12:13:18,555][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:18:20,925][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:23:22,710][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:28:23,988][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:33:25,004][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:38:25,815][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:43:26,831][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:48:27,916][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
If you are using same config file for starting 2 nodes, then try to use localPortRange in DiscoverySpi.

Connect to Ignite server with public and private ip

I tried to connect my Ignite client A (running in Eclipse IDE) to a remote Ignite server B running in a different network (OpenStack VM). B has a public IP ("floating IP"): like 193.224.x.x and a private IP: 192.168.0.4 (not visible from A).
In A, I set the public IP of B to connect to in Java (like: IgniteConfiguration < TcpDiscoverySpi.setIpFinder < TcpDiscoveryVmIpFinder < setAddresses(Arrays.asList("193.224.x.x")). Port 47500 (and some others for Ignite) are open on B to everyone.
Then I start the client I get exception after while:
SEVERE: Failed to reinitialize local partitions (preloading will be stopped): GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=4a4a9c63-b3e6-4191-a966-6fe86071c7d5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.1.100], sockAddrs=[/192.168.1.100:0, /0:0:0:0:0:0:0:1:0, /127.0.0.1:0], discPort=0, order=6, intOrder=0, lastExchangeTime=1530529560836, loc=true, ver=2.5.0#20180523-sha1:86e110c7, isClient=true], topVer=6, nodeId8=4a4a9c63, msg=null, type=NODE_JOINED, tstamp=1530529560973], nodeId=4a4a9c63, evt=NODE_JOINED]
class org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=d5828cee-0bbb-45e8-ba55-c34c1e68f165, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, 0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1530529560939, loc=false, ver=2.5.0#20180523-sha1:86e110c7, isClient=false], topic=TOPIC_CACHE, msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null, partSizes=null, partHistCntrs=null, err=null, client=true, compress=true, finishMsg=null, super=GridDhtPartitionsAbstractMessage [exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=4a4a9c63-b3e6-4191-a966-6fe86071c7d5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.1.100], sockAddrs=[/192.168.1.100:0, /0:0:0:0:0:0:0:1:0, /127.0.0.1:0], discPort=0, order=6, intOrder=0, lastExchangeTime=1530529560836, loc=true, ver=2.5.0#20180523-sha1:86e110c7, isClient=true], topVer=6, nodeId8=4a4a9c63, msg=null, type=NODE_JOINED, tstamp=1530529560973], nodeId=4a4a9c63, evt=NODE_JOINED], lastVer=GridCacheVersion [topVer=0, order=1530529560661, nodeOrder=0], super=GridCacheMessage [msgId=1, depInfo=null, err=null, skipPrepare=false]]], policy=2]
I see signs about that the client is actually connected to the server for a moment (Topology snapshot [ver=6, servers=1, clients=1, CPUs=8,) but after that it is disconnected (or something happens). From the exception it seems (I feel like) the client wants to connect to sockAddrs=[/192.168.0.4:47500..., which fails, instead of 193.224.x.x:47500.
I tried what I found to let B to know its external IP,
in config file, but neither worked:
<property name="addressResolver">
<bean class="org.apache.ignite.configuration.BasicAddressResolver">
<constructor-arg>
<map>
<entry key="192.168.0.4" value="193.224.x.x">
nor
<bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
<property name="localAddress" value="193.224.x.x"/>
nor
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="localAddress" value="193.224.x.x"/>
I have no more idea how to fix it. Ignite docs are very brief regarding to this clustering config.
It looks like Discovery works for you but Communication fails.
You can try supplying your own TcpCommunicationSpi to IgniteConfiguration, setting localAddress on it to 193.224.x.x on server node. However this will likely cause all node-to-node traffic to travel on external network.
You can also try to set localAddress to 193.224.x.x (or other external address) on node A to make sure it doesn't bind to its own 192.168.x.x that isn't shared with B. While leaving configuration on B intact.

Apache ignite node not able to join grid

I'm using static ipfinder configuration installed 2 ignite docker container in 2 different ec2 instances
but nodes not able to join each other below are logs
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=46, servers=2, clients=0, CPUs=6, offheap=3.8GB, heap=2.0GB]
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] Data Regions Configured:
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=3.1 GiB, persistenceEnabled=false]
[07:40:10,697][INFO][exchange-worker-#42][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=05bece82-1950-4fc0-a58e-c062ad4e9b18, customEvt=null, allowMerge=true]
[07:40:10,697][INFO][exchange-worker-#42][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], waitTime=0ms, futInfo=NA]
[07:40:10,697][INFO][exchange-worker-#42][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], crd=true]
[07:40:10,697][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=05bece82-1950-4fc0-a58e-c062ad4e9b18, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.19.0.1, 192.168.1.202], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.19.0.1:47500, /192.168.1.202:47500], discPort=47500, order=46, intOrder=24, lastExchangeTime=1529048390669, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=47, servers=1, clients=0, CPUs=4, offheap=3.1GB, heap=1.0GB]
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] Data Regions Configured:
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=3.1 GiB, persistenceEnabled=false]
[07:40:10,699][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=46, minorTopVer=0]]
[07:40:10,699][INFO][disco-event-worker-#41][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=46, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=47, minorTopVer=0], evt=NODE_FAILED, evtNode=05bece82-1950-4fc0-a58e-c062ad4e9b18, evtNodeClient=false]
[07:40:10,699][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0]]
[07:40:10,700][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0], err=null]
[07:40:10,701][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/53.247.167.223, rmtPort=50787]
[07:40:10,701][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/53.247.167.223, rmtPort=50787]
[07:40:10,701][INFO][tcp-disco-sock-reader-#133][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/53.247.167.223:50787, rmtPort=50787]
[07:40:10,702][INFO][exchange-worker-#42][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=47, minorTopVer=0], evt=NODE_JOINED, node=05bece82-1950-4fc0-a58e-c062ad4e9b18]
[07:40:10,704][INFO][tcp-disco-sock-reader-#133][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/53.247.167.223:50787, rmtPort=50787
You can forward 1st container's host name to the ignite node of 2nd container via a system environment variable in your ignite configuration:
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>#{systemEnvironment['IGNITE_HOST'] ?: '127.0.0.1'}:47500..47509</value>
</list>
</property>
</bean>
An example of docker-compose.yml for 2 communicated ignite services:
version: "3"
services:
ignite:
image: image_name1
networks:
- net
face:
image: image_name2
depends_on:
- ignite
networks:
- net
environment:
IGNITE_HOST: 'ignite'
The ignite node of 'face' can connect to the another ignite node of 'ignite' using the address ignite:47500..47509
Try use internal IP addresses as from this answer http://apache-ignite-users.70518.x6.nabble.com/Ignite-docker-container-not-able-to-join-in-cluster-td22080.html