I can successfully join and leave a single node Apache Ignite 2.8.1 topology running as Docker container on my local Docker server.
Running the exact same program but on a remote Docker server I can see my program joining the cluster topology but before the connection completes I am getting the following connection error
SEVERE: Failed to send message to remote node [node=TcpDiscoveryNode [id=a239f009-bddd-4a06-845f-abb304850849, consistentId=127.0.0.1,172.17.0.13:42002, addrs=ArrayList [127.0.0.1, 172.17.0.13], sockAddrs=HashSet [/172.17.0.13:42002, /127.0.0.1:42002], discPort=42002, order=1, intOrder=1, lastExchangeTime=1605015503009, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null, partsSizes=null, partHistCntrs=null, err=null, client=true, exchangeStartTime=106333448635300, finishMsg=null, super=GridDhtPartitionsAbstractMessage [exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=dc9a3700-5377-4095-ac2b-31a2cea3d9a5, consistentId=dc9a3700-5377-4095-ac2b-31a2cea3d9a5, addrs=ArrayList [0:0:0:0:0:0:0:1, 10.91.7.30, 127.0.0.1, 192.168.1.81, 192.168.38.1], sockAddrs=HashSet [host.docker.internal/192.168.1.81:0, /0:0:0:0:0:0:0:1:0, GBLG7Y7GH2.mshome.net/192.168.38.1:0, /127.0.0.1:0, GBLG7Y7GH2.enterprisenet.org/10.91.7.30:0], discPort=0, order=2, intOrder=0, lastExchangeTime=1605015498538, loc=true, ver=2.8.1#20200521-sha1:86422096, isClient=true], topVer=2, nodeId8=dc9a3700, msg=null, type=NODE_JOINED, tstamp=1605015505481], nodeId=dc9a3700, evt=NODE_JOINED], lastVer=GridCacheVersion [topVer=0, order=1605015496511, nodeOrder=0], super=GridCacheMessage [msgId=1, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], err=null, skipPrepare=false]]]]]
class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a239f009-bddd-4a06-845f-abb304850849, addrs=[/172.17.0.13:42003, /127.0.0.1:42003]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3738)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3458)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3198)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3078)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2918)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2877)
at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2035)
at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:2132)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.sendLocalPartitions(GridDhtPartitionsExchangeFuture.java:2020)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.clientOnlyExchange(GridDhtPartitionsExchangeFuture.java:1436)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:903)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3214)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3063)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a239f009-bddd-4a06-845f-abb304850849, addrs=[/172.17.0.13:42003, /127.0.0.1:42003]]
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3740)
... 15 more
Caused by: java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3584)
... 15 more
Caused by: java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:129)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3584)
... 15 more
In my view the problem relates to the client connection settings, so I tried to increase the client discovery SPI "joinTimeout", "networkTimeout" and "socketTimeout" settings as well as the "connectionTimeout" and "socketWriteTimeout" settings but without success.
You have to set up an AddressResolver for the node running inside the remote Docker container.
Have a look at: https://www.gridgain.com/docs/latest/installation-guide/aws/manual-install-on-ec2#connecting-a-client-node
If you're using Spring configuration, then your config should look something like that:
<property name="addressResolver">
<bean class="org.apache.ignite.configuration.BasicAddressResolver">
<constructor-arg>
<map>
<entry key="172.31.59.27" value="3.93.186.198"/>
</map>
</constructor-arg>
</bean>
</property>
<!-- other properties -->
<!-- Discovery configuration -->
</bean>
Here 172.31.59.27 is an inner IP and 3.93.186.198 is an external IP, that you're connecting to.
Did you open 47500 and 45100 ports both way between your Docker and remote node?
Related
[1 image description here][1][2 image description here][2]
3 image description hereFailed to unmarshal discovery data for component: 1
class org.apache.ignite.IgniteCheckedException: Failed to deserialize object with given class loader: TomEEWebappClassLoader
context: chinawork
delegate: false
[16:27:05] ver. 2.7.0#20181201-sha1:256ae401
[16:27:05] 2018 Copyright(C) Apache Software Foundation
[16:27:05]
[16:27:05] Ignite documentation: http://ignite.apache.org
[16:27:05]
[16:27:05] Quiet mode.
[16:27:05] ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[16:27:05] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[16:27:05]
[16:27:05] OS: Windows 10 10.0 amd64
[16:27:05] VM information: Java(TM) SE Runtime Environment 1.8.0_152-b16 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.152-b16
[16:27:05] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[16:27:05] Initial heap size is 126MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[16:27:05] Configured plugins:
[16:27:05] ^-- None
[16:27:05]
[16:27:05] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]]]
[16:27:06] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[16:27:06] Security status [authentication=off, tls/ssl=off]
[16:27:07] REST protocols do not start on client node. To start the protocols on client node set '-DIGNITE_REST_START_ON_CLIENT=true' system property.
十二月 11, 2018 4:27:13 下午 org.apache.ignite.logger.java.JavaLogger error
严重: Failed to unmarshal discovery data for component: 1
class org.apache.ignite.IgniteCheckedException: Failed to deserialize object with given class loader: TomEEWebappClassLoader
context: cnf-soa
delegate: false
----------> Parent Classloader:
java.net.URLClassLoader#f6f4d33
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:147)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:161)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:82)
at org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket.unmarshalData(DiscoveryDataPacket.java:280)
at org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket.unmarshalGridData(DiscoveryDataPacket.java:123)
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:2006)
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeAddFinishedMessage(ClientImpl.java:2181)
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:2060)
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1905)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:304)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.io.InvalidClassException: javax.cache.configuration.MutableConfiguration; local class incompatible: stream classdesc serialVersionUID = 201306200821, local class serialVersionUID = 201405
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:687)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1880)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1880)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2037)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2282)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2206)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at java.util.HashMap.readObject(HashMap.java:1409)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1158)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2173)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2282)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2206)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:139)
... 12 more
[16:27:14] Performance suggestions for grid 'igniteCosco' (fix if possible)
[16:27:14] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[16:27:14] ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[16:27:14] ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[16:27:14] ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[16:27:14] ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[16:27:14] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[16:27:14]
[16:27:14] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[16:27:14]
[16:27:14] Ignite node started OK (id=9d93bb08, instance name=igniteCosco)
[16:27:14] Topology snapshot [ver=2, locNode=9d93bb08, servers=1, clients=1, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=7.1GB]
十二月 11, 2018 4:27:15 下午 org.apache.ignite.logger.java.JavaLogger error
严重: Failed to send message: TcpDiscoveryClientMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=1ac606c9761-9d93bb08-2ba3-4234-807b-941605b3597b, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
java.net.SocketException: Socket is closed
at java.net.Socket.getSendBufferSize(Socket.java:1215)
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.socketStream(TcpDiscoverySpi.java:1480)
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1606)
at org.apache.ignite.spi.discovery.tcp.ClientImpl$SocketWriter.body(ClientImpl.java:1362)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
十二月 11, 2018 4:27:25 下午 org.apache.ignite.logger.java.JavaLogger error
严重: Failed to reconnect to cluster (consider increasing 'networkTimeout' configuration property) [networkTimeout=5000]
2018-12-11 16:27:25.768 [localhost-startStop-1] ERROR cjf.web.CommonServlet - 加载初始化资源文件[/cjf/config/cjfinit.properties]失败.
javax.cache.CacheException: class org.apache.ignite.IgniteClientDisconnectedException: Failed to execute dynamic cache change request, client node disconnected.
at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
at org.apache.ignite.internal.IgniteKernal.getOrCreateCache(IgniteKernal.java:3310)
at cjf.init.InitIgniteCache.intercept(InitIgniteCache.java:148)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.init.CjfClusterInterceptor.intercept(CjfClusterInterceptor.java:37)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.init.CjfMailInterceptor.intercept(CjfMailInterceptor.java:34)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.init.InitSsoInterceptor.intercept(InitSsoInterceptor.java:52)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.init.InitServletInterceptor.intercept(InitServletInterceptor.java:33)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.init.InitCjfInterceptor.intercept(InitCjfInterceptor.java:50)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.init.SysCacheInterceptor.intercept(SysCacheInterceptor.java:129)
at cjf.common.responsibility.DefaultActionInvocation.invoke(DefaultActionInvocation.java:26)
at cjf.web.CommonServlet.initCaches(CommonServlet.java:111)
at cjf.web.CommonServlet.init(CommonServlet.java:58)
at javax.servlet.GenericServlet.init(GenericServlet.java:158)
at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1144)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1091)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:983)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4978)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5290)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1140)
at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1875)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ignite.IgniteClientDisconnectedException: Failed to execute dynamic cache change request, client node disconnected.
at org.apache.ignite.internal.util.IgniteUtils$15.apply(IgniteUtils.java:948)
at org.apache.ignite.internal.util.IgniteUtils$15.apply(IgniteUtils.java:944)
... 35 common frames omitted
Caused by: org.apache.ignite.internal.IgniteClientDisconnectedCheckedException: Failed to execute dynamic cache change request, client node disconnected.
at org.apache.ignite.internal.processors.cache.GridCacheProcessor.onDisconnected(GridCacheProcessor.java:1173)
at org.apache.ignite.internal.IgniteKernal.onDisconnected(IgniteKernal.java:3949)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:821)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.lambda$onDiscovery$0(GridDiscoveryManager.java:604)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body0(GridDiscoveryManager.java:2667)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body(GridDiscoveryManager.java:2705)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
... 1 common frames omitted
1 image description here 2 image description here
3 image description here
IgniteConfiguration:
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="clientMode" value="true"/>
<property name="igniteInstanceName" value="igniteTest"/>
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>127.0.0.1:47500..47510</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
TomEE has lib/javaee-api-7.0-1.jar library that contains javax-cache version 1.1 while Ignite depends on javax-cache 1.0.
You need to eliminate this dependency issue.It makes sense to exclude java-cache by setting openejb.classloader.forced-skip=javax.cache in system.properties.
Looks like you have put some type in Discovery data which is not present on other nodes.
I can see that you have "local class incompatible". Is it possible that you jave javax-cache 1.0 on one node but javax-cache 1.1 on another? It could cause the problem that you are observing.
I'm using apacheignite:2.5.0 docker image deployed in 2 different
ec2-instances and using static IP finder config below is the config file, one of the node is unable to join in the cluster. I have attached logs also please find
below its accepting connection and disconnecting , i ran docker container with --net=host so conatainer attach all ports to host machine and all ports are opened in security group
#
**>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:util="http://www.springframework.org/schema/util"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/util
http://www.springframework.org/schema/util/spring-util.xsd">
<bean abstract="false" id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>34.241.10.9:47500</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
</beans>**
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.18.0.1, 172.31.29.3], sockAddrs=[/172.31.29.3:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.18.0.1:47500], discPort=47500, order=312, intOrder=157, lastExchangeTime=1529067545288, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Topology snapshot [ver=312, servers=2, clients=0, CPUs=6, offheap=3.8GB, heap=2.0GB]
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] Data Regions Configured:
[12:59:25,309][INFO][disco-event-worker-#37][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=710.0 MiB, persistenceEnabled=false]
[12:59:25,309][INFO][exchange-worker-#38][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, customEvt=null, allowMerge=true]
[12:59:25,309][WARNING][disco-event-worker-#37][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.18.0.1, 172.31.29.3], sockAddrs=[/172.31.29.3:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.18.0.1:47500], discPort=47500, order=312, intOrder=157, lastExchangeTime=1529067545288, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:59:25,310][INFO][exchange-worker-#38][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], waitTime=0ms, futInfo=NA]
[12:59:25,310][INFO][exchange-worker-#38][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], crd=true]
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] Topology snapshot [ver=313, servers=1, clients=0, CPUs=2, offheap=0.69GB, heap=1.0GB]
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] Data Regions Configured:
[12:59:25,310][INFO][disco-event-worker-#37][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=710.0 MiB, persistenceEnabled=false]
[12:59:25,310][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=312, minorTopVer=0]]
[12:59:25,311][INFO][disco-event-worker-#37][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=312, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=313, minorTopVer=0], evt=NODE_FAILED, evtNode=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15, evtNodeClient=false]
[12:59:25,311][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=313, minorTopVer=0]]
[12:59:25,311][INFO][disco-event-worker-#37][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=312, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=313, minorTopVer=0], err=null]
[12:59:25,312][INFO][exchange-worker-#38][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=313, minorTopVer=0], evt=NODE_JOINED, node=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15]
[12:59:25,315][INFO][grid-timeout-worker-#23][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=225f750c, uptime=01:42:00.504]
^-- H/N/C [hosts=1, nodes=1, CPUs=2]
^-- CPU [cur=0.17%, avg=0.4%, GC=0%]
^-- PageMemory [pages=200]
^-- Heap [used=73MB, free=92.47%, comm=981MB]
^-- Non heap [used=53MB, free=96.47%, comm=55MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=6, qSize=0]
^-- System thread pool [active=0, idle=8, qSize=0]
[12:59:25,320][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/34.241.7.9, rmtPort=53627]
[12:59:25,320][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/34.241.7.9, rmtPort=53627]
[12:59:25,320][INFO][tcp-disco-sock-reader-#628][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/34.241.7.9:53627, rmtPort=53627]
[12:59:25,325][INFO][tcp-disco-sock-reader-#628][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/34.241.7.9:53627, rmtPort=53627
[12:59:30,332][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/34.241.7.9, rmtPort=50418]
[12:59:30,332][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/34.241.7.9, rmtPort=50418]
[12:59:30,332][INFO][tcp-disco-sock-reader-#629][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/34.241.7.9:50418, rmtPort=50418]
[12:59:30,334][INFO][tcp-disco-sock-reader-#629][TcpDiscoverySpi] Finished
2nd ignite node logs
[12:13:12,850][INFO][main][TcpCommunicationSpi] Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[12:13:12,869][WARNING][main][TcpCommunicationSpi] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:13:12,888][WARNING][main][NoopCheckpointSpi] Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[12:13:12,918][WARNING][main][GridCollisionManager] Collision resolution is disabled (all jobs will be activated upon arrival).
[12:13:12,919][INFO][main][IgniteKernal] Security status [authentication=off, tls/ssl=off]
[12:13:13,275][INFO][main][ClientListenerProcessor] Client connector processor has started on TCP port 10800
[12:13:13,328][INFO][main][GridTcpRestProtocol] Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
[12:13:13,369][INFO][main][IgniteKernal] Non-loopback local IPs: 172.17.0.1, 172.18.0.1, 172.31.29.3, fe80:0:0:0:10f0:92ff:fea1:d09f%vethee2519f, fe80:0:0:0:42:19ff:fe73:ee80%docker_gwbridge, fe80:0:0:0:42:e6ff:fe14:144a%docker0, fe80:0:0:0:4b3:6ff:fe01:7ee0%eth0, fe80:0:0:0:64f4:8bff:fe83:7e97%vethdae9948, fe80:0:0:0:9474:a1ff:fe6b:3368%vethcb2500f
[12:13:13,370][INFO][main][IgniteKernal] Enabled local MACs: 02421973EE80, 0242E614144A, 06B306017EE0, 12F092A1D09F, 66F48B837E97, 9674A16B3368
[12:13:13,429][INFO][main][TcpDiscoverySpi] Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=07b55edb-cdb7-45eb-bfd6-36fe9c5f5f15]
[12:13:18,555][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:18:20,925][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:23:22,710][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:28:23,988][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:33:25,004][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:38:25,815][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:43:26,831][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
[12:48:27,916][WARNING][main][TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
If you are using same config file for starting 2 nodes, then try to use localPortRange in DiscoverySpi.
I tried to connect my Ignite client A (running in Eclipse IDE) to a remote Ignite server B running in a different network (OpenStack VM). B has a public IP ("floating IP"): like 193.224.x.x and a private IP: 192.168.0.4 (not visible from A).
In A, I set the public IP of B to connect to in Java (like: IgniteConfiguration < TcpDiscoverySpi.setIpFinder < TcpDiscoveryVmIpFinder < setAddresses(Arrays.asList("193.224.x.x")). Port 47500 (and some others for Ignite) are open on B to everyone.
Then I start the client I get exception after while:
SEVERE: Failed to reinitialize local partitions (preloading will be stopped): GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=4a4a9c63-b3e6-4191-a966-6fe86071c7d5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.1.100], sockAddrs=[/192.168.1.100:0, /0:0:0:0:0:0:0:1:0, /127.0.0.1:0], discPort=0, order=6, intOrder=0, lastExchangeTime=1530529560836, loc=true, ver=2.5.0#20180523-sha1:86e110c7, isClient=true], topVer=6, nodeId8=4a4a9c63, msg=null, type=NODE_JOINED, tstamp=1530529560973], nodeId=4a4a9c63, evt=NODE_JOINED]
class org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=d5828cee-0bbb-45e8-ba55-c34c1e68f165, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, 0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1530529560939, loc=false, ver=2.5.0#20180523-sha1:86e110c7, isClient=false], topic=TOPIC_CACHE, msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null, partSizes=null, partHistCntrs=null, err=null, client=true, compress=true, finishMsg=null, super=GridDhtPartitionsAbstractMessage [exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=4a4a9c63-b3e6-4191-a966-6fe86071c7d5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.1.100], sockAddrs=[/192.168.1.100:0, /0:0:0:0:0:0:0:1:0, /127.0.0.1:0], discPort=0, order=6, intOrder=0, lastExchangeTime=1530529560836, loc=true, ver=2.5.0#20180523-sha1:86e110c7, isClient=true], topVer=6, nodeId8=4a4a9c63, msg=null, type=NODE_JOINED, tstamp=1530529560973], nodeId=4a4a9c63, evt=NODE_JOINED], lastVer=GridCacheVersion [topVer=0, order=1530529560661, nodeOrder=0], super=GridCacheMessage [msgId=1, depInfo=null, err=null, skipPrepare=false]]], policy=2]
I see signs about that the client is actually connected to the server for a moment (Topology snapshot [ver=6, servers=1, clients=1, CPUs=8,) but after that it is disconnected (or something happens). From the exception it seems (I feel like) the client wants to connect to sockAddrs=[/192.168.0.4:47500..., which fails, instead of 193.224.x.x:47500.
I tried what I found to let B to know its external IP,
in config file, but neither worked:
<property name="addressResolver">
<bean class="org.apache.ignite.configuration.BasicAddressResolver">
<constructor-arg>
<map>
<entry key="192.168.0.4" value="193.224.x.x">
nor
<bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
<property name="localAddress" value="193.224.x.x"/>
nor
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="localAddress" value="193.224.x.x"/>
I have no more idea how to fix it. Ignite docs are very brief regarding to this clustering config.
It looks like Discovery works for you but Communication fails.
You can try supplying your own TcpCommunicationSpi to IgniteConfiguration, setting localAddress on it to 193.224.x.x on server node. However this will likely cause all node-to-node traffic to travel on external network.
You can also try to set localAddress to 193.224.x.x (or other external address) on node A to make sure it doesn't bind to its own 192.168.x.x that isn't shared with B. While leaving configuration on B intact.
I'm using static ipfinder configuration installed 2 ignite docker container in 2 different ec2 instances
but nodes not able to join each other below are logs
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=46, servers=2, clients=0, CPUs=6, offheap=3.8GB, heap=2.0GB]
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] Data Regions Configured:
[07:40:10,696][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=3.1 GiB, persistenceEnabled=false]
[07:40:10,697][INFO][exchange-worker-#42][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], crd=true, evt=NODE_JOINED, evtNode=05bece82-1950-4fc0-a58e-c062ad4e9b18, customEvt=null, allowMerge=true]
[07:40:10,697][INFO][exchange-worker-#42][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], waitTime=0ms, futInfo=NA]
[07:40:10,697][INFO][exchange-worker-#42][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], crd=true]
[07:40:10,697][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=05bece82-1950-4fc0-a58e-c062ad4e9b18, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.19.0.1, 192.168.1.202], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /172.19.0.1:47500, /192.168.1.202:47500], discPort=47500, order=46, intOrder=24, lastExchangeTime=1529048390669, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=47, servers=1, clients=0, CPUs=4, offheap=3.1GB, heap=1.0GB]
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] Data Regions Configured:
[07:40:10,698][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=3.1 GiB, persistenceEnabled=false]
[07:40:10,699][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=46, minorTopVer=0]]
[07:40:10,699][INFO][disco-event-worker-#41][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=46, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=47, minorTopVer=0], evt=NODE_FAILED, evtNode=05bece82-1950-4fc0-a58e-c062ad4e9b18, evtNodeClient=false]
[07:40:10,699][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0]]
[07:40:10,700][INFO][disco-event-worker-#41][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=46, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0], err=null]
[07:40:10,701][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/53.247.167.223, rmtPort=50787]
[07:40:10,701][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/53.247.167.223, rmtPort=50787]
[07:40:10,701][INFO][tcp-disco-sock-reader-#133][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/53.247.167.223:50787, rmtPort=50787]
[07:40:10,702][INFO][exchange-worker-#42][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=47, minorTopVer=0], evt=NODE_JOINED, node=05bece82-1950-4fc0-a58e-c062ad4e9b18]
[07:40:10,704][INFO][tcp-disco-sock-reader-#133][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/53.247.167.223:50787, rmtPort=50787
You can forward 1st container's host name to the ignite node of 2nd container via a system environment variable in your ignite configuration:
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>#{systemEnvironment['IGNITE_HOST'] ?: '127.0.0.1'}:47500..47509</value>
</list>
</property>
</bean>
An example of docker-compose.yml for 2 communicated ignite services:
version: "3"
services:
ignite:
image: image_name1
networks:
- net
face:
image: image_name2
depends_on:
- ignite
networks:
- net
environment:
IGNITE_HOST: 'ignite'
The ignite node of 'face' can connect to the another ignite node of 'ignite' using the address ignite:47500..47509
Try use internal IP addresses as from this answer http://apache-ignite-users.70518.x6.nabble.com/Ignite-docker-container-not-able-to-join-in-cluster-td22080.html
I'm setting up a Apache Ignite cluster and have difficulties keeping the topology alive when more than two nodes connect that are connected through a LAN switch.
There are many warnings and problems reported in the log but I wonder what are the correct steps for me to start trying isolate the problem? Ping in both directions works fine, also after some 30s or 1m the connection works but they also lose each other again often. Sometimes the 3rd node trying to connect causes the whole cluster to fail.
[20:41:34,761][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).
[20:41:34,761][INFO][tcp-disco-sock-reader-#28][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.10.161:34361, rmtPort=34361
[20:41:34,762][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=dd44ea86-5302-47a0-b3c0-86acdcf7e771, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.10.162], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, node_2/192.168.10.162:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1524656494760, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,764][INFO][tcp-disco-sock-reader-#14][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.10.1:55641, rmtPort=55641
[20:41:34,766][WARNING][disco-event-worker-#161][GridDiscoveryManager] Stopping local node according to configured segmentation policy.
[20:41:34,767][WARNING][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=379eb246-e111-4510-a3f6-09554667d769, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.10.161], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.161:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1524656073909, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,768][INFO][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=6, servers=2, clients=0, CPUs=60, heap=2.0GB]
[20:41:34,770][WARNING][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1524656176508, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,770][INFO][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=7, servers=1, clients=0, CPUs=56, heap=1.0GB]
[20:41:34,771][INFO][Thread-3][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][Thread-3][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=5, minorTopVer=0]]
[20:41:34,774][INFO][Thread-3][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][disco-event-worker-#161][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=5, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=379eb246-e111-4510-a3f6-09554667d769, evtNodeClient=false]
[20:41:34,774][INFO][disco-event-worker-#161][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=5, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=dd64661b-0679-4a14-9440-d876e5c35bd5, evtNodeClient=false]
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, minorTopVer=0]]
[20:41:34,787][INFO][Thread-3][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
[20:41:34,803][INFO][Thread-3][IgniteKernal]
>>> +---------------------------------------------------------------------------------+
>>> Ignite ver. 2.3.0#20171028-sha1:8add7fd5b501b40658096cdde48af9e948aa8150 stopped OK
>>> +---------------------------------------------------------------------------------+
>>> Grid uptime: 00:07:08.412
[root#node_2 apache-ignite-fabric-2.3.0-bin]# packet_write_wait: Connection to 192.168.10.162 port 22: Broken pipe
On one of the other nodes something like this is shown after some time:
[22:45:54,026][SEVERE][grid-nio-worker-tcp-comm-6-#127][TcpCommunicationSpi] Failed to process selector key [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=6, bytesRcvd=1578, bytesSent=5266, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-6, igniteInstanceName=null, finished=false, hashCode=733187042, interrupted=false, runner=grid-nio-worker-tcp-comm-6-#127]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=4, resendCnt=0, rcvCnt=4, sentCnt=4, reserved=true, lastAck=4, nodeLeft=false, node=TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1524656494855, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=4, resendCnt=0, rcvCnt=4, sentCnt=4, reserved=true, lastAck=4, nodeLeft=false, node=TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1524656494855, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/192.168.10.161:47100, rmtAddr=/192.168.10.1:47884, createTime=1524656504308, closeTime=0, bytesSent=5266, bytesRcvd=1578, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1524663359458, lastSndTime=1524656672249, lastRcvTime=1524663359458, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser#32244b13, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1233)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2272)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2048)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1717)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
[22:45:54,027][WARNING][grid-nio-worker-tcp-comm-6-#127][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Connection reset by peer]
[22:46:41,002][INFO][grid-timeout-worker-#119][IgniteKernal]
Any idea where I should start looking for the cause of the problem?
Thanks!
As suggested in the warning
[20:41:34,761][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).
the reason is likely a network issue. Pings may work fine (although I'd check failure rate over a long enough interval, like 10-15 minutes), but try also a long-running TCP connection (maybe via a netcat or something).
Another possible reason is high load on the nodes. E.g. if a node goes into a stop-the-world GC and is unable to respond for a long time, it may also be kicked out of the cluster.
To make the cluster more tolerant to short-time network and responsiveness issues, try increasing IgniteConfiguration.failureDetectionTimeout setting.