How to debug redis with memory usage is over 20GB? - redis

I am having a serious issue with Redis, and I cannot get my head around the root cause of the issue, I am running dockerised redis6-alpine and it is working well in some other setups with the exact configurations, any hint/guidance are appreciated! here is a quick view of my configurations:
# Server
redis_version:6.0.14
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:e0ee1c530b6d150b
redis_mode:standalone
os:Linux 4.19.0-16-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:10.2.1
process_id:1
run_id:79e32727ad7a6eb2a9427060ba44fa918c01de5a
tcp_port:6379
uptime_in_seconds:215118
uptime_in_days:2
hz:10
configured_hz:10
lru_clock:14634210
executable:/data/redis-server
config_file:
io_threads_active:0
# Clients
connected_clients:58
client_recent_max_input_buffer:131072
client_recent_max_output_buffer:786456
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
# Memory
used_memory:23814128424
used_memory_human:22.18G
used_memory_rss:25484832768
used_memory_rss_human:23.73G
used_memory_peak:26912783984
used_memory_peak_human:25.06G
used_memory_peak_perc:88.49%
used_memory_overhead:43319344
used_memory_startup:803160
used_memory_dataset:23770809080
used_memory_dataset_perc:99.82%
allocator_allocated:23814697320
allocator_active:25313169408
allocator_resident:25528991744
total_system_memory:135186935808
total_system_memory_human:125.90G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.06
allocator_frag_bytes:1498472088
allocator_rss_ratio:1.01
allocator_rss_bytes:215822336
rss_overhead_ratio:1.00
rss_overhead_bytes:-44158976
mem_fragmentation_ratio:1.07
mem_fragmentation_bytes:1670633120
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:1345144
mem_aof_buffer:8
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0
# Persistence
loading:0
rdb_changes_since_last_save:193999866
rdb_bgsave_in_progress:0
rdb_last_save_time:1625031828
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:167
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:565936128
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:18283706329
aof_base_size:18280419536
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:5857404
total_commands_processed:518486668
instantaneous_ops_per_sec:1821
total_net_input_bytes:177162832735
total_net_output_bytes:8187427886085
instantaneous_input_kbps:432.05
instantaneous_output_kbps:53204.55
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:30957
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:21298
evicted_keys:0
keyspace_hits:372132606
keyspace_misses:10318752
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:593467
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:4
total_reads_processed:394752846
total_writes_processed:382030652
io_threaded_reads_processed:0
io_threaded_writes_processed:0
# Replication
role:master
connected_slaves:0
master_replid:2a7e6a090bdaa0474ee666c6c6b555e0c1c3942a
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:12596.074365
used_cpu_user:6090.006186
used_cpu_sys_children:164.531584
used_cpu_user_children:1229.917070
# Modules
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=134616,expires=96904,avg_ttl=777017159
db1:keys=348560,expires=332665,avg_ttl=63989490
redis appendonly.aof file size

Related

RedisJson and RedisSearch: (error) elem.map is not a function

I get this error when performing an FT.SEARCH through the redis-cli via both redis/redis-stack-server:latest and redislabs/redismod:latest and following the how-to herefor creating an index, documents, and querying them: https://redis.io/docs/stack/search/indexing_json/
It also happens when I follow these steps:
> FT.CREATE myIdx on JSON PREFIX 1 entity: SCHEMA $.position.y AS y NUMERIC $.position.x AS x NUMERIC $.name AS name TEXT
> JSON.SET entity:1 $ '{"id":"entityA","name":"EntityAlpha","longName":"This is entity alpha","speed":9.66,"ownerId":"god","position":{"x":15,"y":15},"color":"red"}'
> JSON.SET entity:2 $ '{"id":"entityB","name":"EntityBeta","longName":"This is entity beta","speed":9.66,"ownerId":"god","position":{"x":20,"y":20},"color":"red"}'
> JSON.SET entity:3 $ '{"id":"entityC","name":"EntityCeta","longName":"This is entity ceta","speed":9.66,"ownerId":"god","position":{"x":15,"y":25},"color":"fire"}'
> FT.SEARCH myIdx "#name:(Entity*)"
(error) elem.map is not a function
> FT.SEARCH myIdx "#x:[0 200]"
(error) elem.map is not a function
> FT.SEARCH myIdx "#name:(EntityAlpha)"
(error) elem.map is not a function
Here are results of 'info':
> info
# Server
redis_version:6.2.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:9c335ca9779faba5
redis_mode:standalone
os:Linux 5.15.0-48-generic x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:10.2.1
process_id:1
process_supervised:no
run_id:4dad70a5a0bf25821f440ad397ed5d114e637fbf
tcp_port:6379
server_time_usec:1664744098230246
uptime_in_seconds:39
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:3799714
executable:/data/redis-server
config_file:
io_threads_active:0
# Clients
connected_clients:2
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:16
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
# Memory
used_memory:9735336
used_memory_human:9.28M
used_memory_rss:32952320
used_memory_rss_human:31.43M
used_memory_peak:9735336
used_memory_peak_human:9.28M
used_memory_peak_perc:100.00%
used_memory_overhead:9354584
used_memory_startup:9313304
used_memory_dataset:380752
used_memory_dataset_perc:90.22%
allocator_allocated:10321296
allocator_active:10809344
allocator_resident:13840384
total_system_memory:12321312768
total_system_memory_human:11.48G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.05
allocator_frag_bytes:488048
allocator_rss_ratio:1.28
allocator_rss_bytes:3031040
rss_overhead_ratio:2.38
rss_overhead_bytes:19111936
mem_fragmentation_ratio:3.40
mem_fragmentation_bytes:23259752
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:40984
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
# Persistence
loading:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1664744059
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0
# Stats
total_connections_received:2
total_commands_processed:6
instantaneous_ops_per_sec:0
total_net_input_bytes:28
total_net_output_bytes:4712
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
keyspace_hits:20
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:0
dump_payload_sanitizations:0
total_reads_processed:2
total_writes_processed:1
io_threaded_reads_processed:0
io_threaded_writes_processed:0
# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:0f0fce86ffcbff2fc056b6d4e788242be12f2b2c
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:0.827450
used_cpu_user:0.459694
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
used_cpu_sys_main_thread:0.061016
used_cpu_user_main_thread:0.130168
# Modules
module:name=graph,ver=20815,api=1,filters=0,usedby=[],using=[ReJSON],options=[]
module:name=timeseries,ver=10616,api=1,filters=0,usedby=[],using=[],options=[handle-io-errors]
module:name=ReJSON,ver=20011,api=1,filters=0,usedby=[search|graph],using=[],options=[handle-io-errors]
module:name=ai,ver=10205,api=1,filters=0,usedby=[],using=[],options=[handle-io-errors]
module:name=rg,ver=10204,api=1,filters=1,usedby=[rg],using=[rg],options=[]
module:name=search,ver=999999,api=1,filters=0,usedby=[],using=[ReJSON],options=[handle-io-errors]
module:name=bf,ver=20209,api=1,filters=0,usedby=[],using=[],options=[]
# Errorstats
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=5,expires=0,avg_ttl=0
Please help.

Ignite issue when a node in the cluster becomes unstable unable to join the cluster and hangs indefinitely

HI I am facing a critical issue with Ignite in our production servers . We have 2 instances with heap sizes of 8gb each . Sometimes due to long gc pause or network issue one of our instances gets stopped . This causes aws auto-scaling to kick in and bring another instance up . This is fine but we have observed that in tis state the grid becomes unstable and our new ignite instaces are never able to join the topology and hang forever causing new autoscaled instances to come again and again .The workaround for this is to restart other instances in the cluster as doing so causes nodes to join again .But ideally in a prod environment this should happen automatically with auto scaling .
Had also added a longer failuredetection timeout but that also doesnt solve it completely and we still observe this sometimes .
The logs observed on the new instances not coming up is as below .Igite version use is 2.4 and off heap mode is used for partitioned caches .Our grid is setup using tcp discovery service using a s3 bucket .
I have some transactional caches as well which do lock based on
tryLocks.
evtLatch=0, remaining=[a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1272213534]]]
2018-07-18 16:34:10.534 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], node=7d5e83aa-736a-4190-8b64-7261db7382f6]. Dumping pending objects that might be the cause:
2018-07-18 16:34:20.534 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], node=7d5e83aa-736a-4190-8b64-7261db7382f6]. Dumping pending objects that might be the cause:
2018-07-18 16:34:20.534 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Ready affinity version: AffinityTopologyVersion [topVer=-1, minorTopVer=0]
2018-07-18 16:34:20.535 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Last exchange future: GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931660255, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=32, nodeId8=7d5e83aa, msg=null, type=NODE_JOINED, tstamp=1531931329481], crd=TcpDiscoveryNode [id=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, addrs=[10.83.87.131, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-87-131.ec2.internal/10.83.87.131:47500], discPort=47500, order=26, intOrder=14, lastExchangeTime=1531931329258, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931660255, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=32, nodeId8=7d5e83aa, msg=null, type=NODE_JOINED, tstamp=1531931329481], nodeId=7d5e83aa, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=true, hash=247159314], init=true, lastVer=null, partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[ExplicitLockReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]], TxReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]], DataStreamerReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]]]], exchActions=ExchangeActions [startCaches=null, stopCaches=null, startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null], affChangeMsg=null, initTs=1531931329576, centralizedAff=false, changeGlobalStateE=null, done=false, state=SRV, evtLatch=0, remaining=[a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1272213534]]
2018-07-18 16:34:20.535 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.a.i.i.p.c.GridCachePartitionExchangeManager - First 10 pending exchange futures [total=0]
2018-07-18 16:34:20.535 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Last 10 exchange futures (total: 1):
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - >>> GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931660255, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false], done=false]
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Pending transactions:
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Pending explicit locks:
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Pending cache futures:
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Pending atomic cache futures:
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Pending data streamer futures:
2018-07-18 16:34:20.536 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Pending transaction deadlock detection futures:
2018-07-18 16:34:20.547 UTC [FDPS] [grid-nio-worker-tcp-comm-3-#28%fdps%] [INFO ] [,] o.apache.ignite.internal.diagnostic - Exchange future waiting for coordinator response [crd=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0]]
Remote node information:
General node info [id=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, client=false, discoTopVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], time=12:34:20.537]
Partitions exchange info [readyVer=AffinityTopologyVersion [topVer=29, minorTopVer=0]]
Last initialized exchange future: GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=ba6aba6c-7f5d-41bf-bfcc-5eefcad36b62, addrs=[10.83.85.122, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-85-122.ec2.internal/10.83.85.122:47500], discPort=47500, order=30, intOrder=16, lastExchangeTime=1531930705943, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=30, nodeId8=a450db0b, msg=Node joined: TcpDiscoveryNode [id=ba6aba6c-7f5d-41bf-bfcc-5eefcad36b62, addrs=[10.83.85.122, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-85-122.ec2.internal/10.83.85.122:47500], discPort=47500, order=30, intOrder=16, lastExchangeTime=1531930705943, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], type=NODE_JOINED, tstamp=1531930706210], crd=TcpDiscoveryNode [id=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, addrs=[10.83.87.131, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-87-131.ec2.internal/10.83.87.131:47500], discPort=47500, order=26, intOrder=14, lastExchangeTime=1531931660254, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=30, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=ba6aba6c-7f5d-41bf-bfcc-5eefcad36b62, addrs=[10.83.85.122, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-85-122.ec2.internal/10.83.85.122:47500], discPort=47500, order=30, intOrder=16, lastExchangeTime=1531930705943, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=30, nodeId8=a450db0b, msg=Node joined: TcpDiscoveryNode [id=ba6aba6c-7f5d-41bf-bfcc-5eefcad36b62, addrs=[10.83.85.122, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-85-122.ec2.internal/10.83.85.122:47500], discPort=47500, order=30, intOrder=16, lastExchangeTime=1531930705943, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], type=NODE_JOINED, tstamp=1531930706210], nodeId=ba6aba6c, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1921954756], init=false, lastVer=GridCacheVersion [topVer=0, order=1531930704443, nodeOrder=0], partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion [topVer=30, minorTopVer=0], futures=[ExplicitLockReleaseFuture [topVer=AffinityTopologyVersion [topVer=30, minorTopVer=0], futures=[ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935479, nodeOrder=26], threadId=39726, id=559000, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=221, val=49583853497448469294730566354366524577617095530402283666, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547787212113, nodeOrder=26], threadId=39741, id=603904, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=288, val=49583853499611641578988037213538229804531966271996035234, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935487, nodeOrder=26], threadId=39740, id=558993, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=133, val=49583853497448469294730566354417299462040910024459419794, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935323, nodeOrder=26], threadId=39728, id=558949, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=1023, val=49583853497448469294730566353278491339963927967496667282, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935470, nodeOrder=26], threadId=39951, id=559009, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=556, val=49583853497448469294730566354226289182541798339977937042, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935497, nodeOrder=26], threadId=39683, id=558982, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=373, val=49583853497448469294730566354541818821461216966893109394, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935339, nodeOrder=26], threadId=39682, id=558941, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=156, val=49583853497448469294730566353353444740780034976328450194, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935358, nodeOrder=26], threadId=39936, id=558921, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=59, val=49583853497448469294730566353578304943228356208982229138, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandida... and 48550 skipped ...ead=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935486, nodeOrder=26], threadId=39894, id=558992, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=488, val=49583853497448469294730566354434224423515514832905306258, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], ExplicitLockSpan [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], firstCand=GridCacheMvccCandidate [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, ver=GridCacheVersion [topVer=141782290, order=1547786935331, nodeOrder=26], threadId=39893, id=558948, topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], reentry=null, otherNodeId=null, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=570, val=49583853497448469294730566353289371672340459630069022866, hasValBytes=false], masks=local=1|owner=0|ready=0|reentry=0|used=0|tx=0|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]]]], TxReleaseFuture [topVer=AffinityTopologyVersion [topVer=30, minorTopVer=0], futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion [topVer=30, minorTopVer=0], futures=[]], DataStreamerReleaseFuture [topVer=AffinityTopologyVersion [topVer=30, minorTopVer=0], futures=[]]]], exchActions=null, affChangeMsg=null, initTs=1531930706210, centralizedAff=false, changeGlobalStateE=null, done=false, state=CRD, evtLatch=0, remaining=[ba6aba6c-7f5d-41bf-bfcc-5eefcad36b62], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=325602672]]
Communication SPI statistics [rmtNode=7d5e83aa-736a-4190-8b64-7261db7382f6]
Communication SPI recovery descriptors:
[key=ConnectionKey [nodeId=7d5e83aa-736a-4190-8b64-7261db7382f6, idx=0, connCnt=0], msgsSent=5, msgsAckedByRmt=0, msgsRcvd=7, lastAcked=0, reserveCnt=1, descIdHash=1972345954]
Communication SPI clients:
[node=7d5e83aa-736a-4190-8b64-7261db7382f6, client=GridTcpNioCommunicationClient [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=3, bytesRcvd=5740, bytesSent=77322, bytesRcvd0=853, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-3, igniteInstanceName=fdps, finished=false, hashCode=2068348067, interrupted=false, runner=grid-nio-worker-tcp-comm-3-#28%fdps%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=7, sentCnt=5, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931329178, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=true, connectCnt=0, queueLimit=262144, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=7, sentCnt=5, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931329178, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=true, connectCnt=0, queueLimit=262144, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/10.83.87.131:47100, rmtAddr=/10.83.89.183:34664, createTime=1531931330498, closeTime=0, bytesSent=77322, bytesRcvd=5740, bytesSent0=0, bytesRcvd0=853, sndSchedTime=1531931330498, lastSndTime=1531931500547, lastRcvTime=1531931660527, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=org.apache.ignite.internal.util.nio.GridDirectParser#665c2413, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]], super=GridAbstractCommunicationClient [lastUsed=1531931330508, closed=false, connIdx=0]]]
NIO sessions statistics:
>> Selector info [idx=3, keysCnt=1, bytesRcvd=5740, bytesRcvd0=853, bytesSent=77322, bytesSent0=0]
Connection info [in=true, rmtAddr=/10.83.89.183:34664, locAddr=/10.83.87.131:47100, msgsSent=5, msgsAckedByRmt=0, descIdHash=1972345954, unackedMsgs=[IgniteDiagnosticMessage, IgniteDiagnosticMessage, IgniteDiagnosticMessage, IgniteDiagnosticMessage, IgniteDiagnosticMessage], msgsRcvd=7, lastAcked=0, descIdHash=1972345954, bytesRcvd=5740, bytesRcvd0=853, bytesSent=77322, bytesSent0=0, opQueueSize=0]
Exchange future: GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931329178, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=32, nodeId8=a450db0b, msg=Node joined: TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931329178, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], type=NODE_JOINED, tstamp=1531931329402], crd=null, exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931329178, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=32, nodeId8=a450db0b, msg=Node joined: TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931329178, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], type=NODE_JOINED, tstamp=1531931329402], nodeId=7d5e83aa, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=980776600], init=false, lastVer=GridCacheVersion [topVer=0, order=1531931327875, nodeOrder=0], partReleaseFut=null, exchActions=null, affChangeMsg=null, initTs=0, centralizedAff=false, changeGlobalStateE=null, done=false, state=null, evtLatch=0, remaining=[], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=2138568466]]
Local communication statistics:
Communication SPI statistics [rmtNode=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9]
Communication SPI recovery descriptors:
[key=ConnectionKey [nodeId=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, idx=0, connCnt=-1], msgsSent=7, msgsAckedByRmt=0, msgsRcvd=6, lastAcked=0, reserveCnt=1, descIdHash=1891649612]
Communication SPI clients:
Communication SPI clients:
[node=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, client=GridTcpNioCommunicationClient [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0, bytesRcvd=92833, bytesSent=5698, bytesRcvd0=15539, bytesSent0=853, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-0, igniteInstanceName=fdps, finished=false, hashCode=2040212682, interrupted=false, runner=grid-nio-worker-tcp-comm-0-#25%fdps%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=6, sentCnt=7, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, addrs=[10.83.87.131, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-87-131.ec2.internal/10.83.87.131:47500], discPort=47500, order=26, intOrder=14, lastExchangeTime=1531931329258, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false, connectCnt=1, queueLimit=262144, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=6, sentCnt=7, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, addrs=[10.83.87.131, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-87-131.ec2.internal/10.83.87.131:47500], discPort=47500, order=26, intOrder=14, lastExchangeTime=1531931329258, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false, connectCnt=1, queueLimit=262144, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/10.83.89.183:34664, rmtAddr=ip-10-83-87-131.ec2.internal/10.83.87.131:47100, createTime=1531931330468, closeTime=0, bytesSent=5698, bytesRcvd=92833, bytesSent0=853, bytesRcvd0=15539, sndSchedTime=1531931330468, lastSndTime=1531931660528, lastRcvTime=1531931660538, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=org.apache.ignite.internal.util.nio.GridDirectParser#72024a61, directMode=true], GridConnectionBytesVerifyFilter], accepted=false]], super=GridAbstractCommunicationClient [lastUsed=1531931330468, closed=false, connIdx=0]]]
NIO sessions statistics:
>> Selector info [idx=0, keysCnt=1, bytesRcvd=92833, bytesRcvd0=15539, bytesSent=5698, bytesSent0=853]
Connection info [in=false, rmtAddr=ip-10-83-87-131.ec2.internal/10.83.87.131:47100, locAddr=/10.83.89.183:34664, msgsSent=7, msgsAckedByRmt=0, descIdHash=1891649612, unackedMsgs=[GridDhtPartitionsSingleMessage, IgniteDiagnosticMessage, IgniteDiagnosticMessage, IgniteDiagnosticMessage, IgniteDiagnosticMessage], msgsRcvd=6, lastAcked=0, descIdHash=1891649612, bytesRcvd=92833, bytesRcvd0=15539, bytesSent=5698, bytesSent0=853, opQueueSize=0]
2018-07-18 16:34:29.598 UTC [FDPS] [localhost-startStop-1] [WARN ] [,] o.a.i.i.p.c.GridCachePartitionExchangeManager - Still waiting for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931669507, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=32, nodeId8=7d5e83aa, msg=null, type=NODE_JOINED, tstamp=1531931329481], crd=TcpDiscoveryNode [id=a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9, addrs=[10.83.87.131, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-87-131.ec2.internal/10.83.87.131:47500], discPort=47500, order=26, intOrder=14, lastExchangeTime=1531931329258, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=7d5e83aa-736a-4190-8b64-7261db7382f6, addrs=[10.83.89.183, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, ip-10-83-89-183.ec2.internal/10.83.89.183:47500], discPort=47500, order=32, intOrder=17, lastExchangeTime=1531931669507, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false], topVer=32, nodeId8=7d5e83aa, msg=null, type=NODE_JOINED, tstamp=1531931329481], nodeId=7d5e83aa, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=true, hash=247159314], init=true, lastVer=null, partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[ExplicitLockReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]], TxReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]], DataStreamerReleaseFuture [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], futures=[]]]], exchActions=ExchangeActions [startCaches=null, stopCaches=null, startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null], affChangeMsg=null, initTs=1531931329576, centralizedAff=false, changeGlobalStateE=null, done=false, state=SRV, evtLatch=0, remaining=[a450db0b-ce86-4f0b-a34b-a2f9c83bb3d9], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1272213534]]]
2018-07-18 16:34:30.537 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], node=7d5e83aa-736a-4190-8b64-7261db7382f6]. Dumping pending objects that might be the cause:
2018-07-18 16:34:40.537 UTC [FDPS] [exchange-worker-#35%fdps%] [WARN ] [,] o.apache.ignite.internal.diagnostic - Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=32, minorTopVer=0], node=7d5e83aa-736a-4190-8b64-7261db7382f6]. Dumping pending objects that might be the cause:
Info about the other node 10-83-85-122
The other joining node never got started and was stuck in the ignite start phase . The logs also dont show the node to get up or the ip discovery to get kicked in . to eventually cause the node to be removed via autoscaling .
Transactional errors received
javax.cache.CacheException: Failed to acquire lock for keys (primary node left grid, retry transaction if possible) [keys=[UserKeyCacheObjectImpl [part=281,
Partition map exchange is a process of exchanging information between nodes where each piece of data is stored. It happens every time, when topology changes.
Every node sends a GridDhtPartitionsSingleMessage to a coordinator. Once the coordinator collected all such messages, it sends GridDhtPartitionsFullMessage back to other nodes. These messages are sent over communication SPI.
But if some of non-coordinator nodes don't send the SingleMessage to the coordinator, or if the coordinator doesn't send the FullMessage, then "Failed to wait for partition map exchange" error occurs.
Judging by the piece of log, that you provided, a node with ID=ba6aba6c didn't send the SingleMessage to the coordinator. It may mean, that communication SPI doesn't work there properly. Make sure, that ports, that are required for communication SPI are available. Usually it's 47100..47200.
Also joining node may be stuck on something. Look at its log to figure out, what happens there.

Ignite cache fails after Failed to process selector key...java.io.IOException: Broken pipe exception

We are running Ignite 2.4 & have 2 server nodes & 30 odd client nodes. We use zookeeper discovery & the nodes are deployed in a Docker swarm environment.
After a while of running i saw the below exception in one of the ignite clients & the caches no longer seem to work,
service-be - [INFO ] 2018-06-15 02:01:52.256 [grid-timeout-worker-#55] org.apache.ignite.internal.IgniteKernal -
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=5249f20c, uptime=02:49:02.178]
^-- H/N/C [hosts=34, nodes=34, CPUs=816]
^-- CPU [cur=24.2%, avg=0.27%, GC=0%]
^-- PageMemory [pages=0]
^-- Heap [used=848MB, free=17.19%, comm=1024MB]
^-- Non heap [used=241MB, free=84.12%, comm=251MB]
^-- Outbound messages queue [size=4]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=24, qSize=0]
service-be - [INFO ] 2018-06-15 02:01:52.432 [grid-nio-worker-tcp-comm-2-#59] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.11.0.7:47100, rmtAddr=/10.11.0.75:59204]
service-be - [INFO ] 2018-06-15 02:01:52.433 [grid-nio-worker-tcp-comm-2-#59] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Received incoming connection when already connected to this node, rejecting [locNode=5249f20c-456b-4b6f-ab41-f5cd5c3c05ba, rmtNode=6739c9af-42d1-4aad-ac9c-ac738ed13534]
service-be - [INFO ] 2018-06-15 02:01:52.634 [grid-nio-worker-tcp-comm-3-#60] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.11.0.7:47100, rmtAddr=/10.11.0.75:59206]
service-be - [INFO ] 2018-06-15 02:01:52.635 [grid-nio-worker-tcp-comm-3-#60] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Received incoming connection when already connected to this node, rejecting [locNode=5249f20c-456b-4b6f-ab41-f5cd5c3c05ba, rmtNode=6739c9af-42d1-4aad-ac9c-ac738ed13534]
service-be - [INFO ] 2018-06-15 02:01:52.836 [grid-nio-worker-tcp-comm-4-#61] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.11.0.7:47100, rmtAddr=/10.11.0.75:59208]
service-be - [INFO ] 2018-06-15 02:01:52.837 [grid-nio-worker-tcp-comm-4-#61] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Received incoming connection when already connected to this node, rejecting [locNode=5249f20c-456b-4b6f-ab41-f5cd5c3c05ba, rmtNode=6739c9af-42d1-4aad-ac9c-ac738ed13534]
service-be - [INFO ] 2018-06-15 02:01:53.038 [grid-nio-worker-tcp-comm-5-#62] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.11.0.7:47100, rmtAddr=/10.11.0.75:59210]
service-be - [INFO ] 2018-06-15 02:01:53.039 [grid-nio-worker-tcp-comm-5-#62] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Received incoming connection when already connected to this node, rejecting [locNode=5249f20c-456b-4b6f-ab41-f5cd5c3c05ba, rmtNode=6739c9af-42d1-4aad-ac9c-ac738ed13534]
service-be - [ERROR] 2018-06-15 02:01:53.231 [grid-nio-worker-tcp-comm-0-#57] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Failed to process selector key [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0, bytesRcvd=70700138, bytesSent=18478193, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-0, igniteInstanceName=null, finished=false, hashCode=30436088, interrupted=false, runner=grid-nio-worker-tcp-comm-0-#57]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=186 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=48224, resendCnt=0, rcvCnt=111504, sentCnt=48229, reserved=true, lastAck=111488, nodeLeft=false, node=TcpDiscoveryNode [id=6739c9af-42d1-4aad-ac9c-ac738ed13534, addrs=[10.11.0.74, 10.11.0.75, 127.0.0.1, 172.18.0.22], sockAddrs=[/172.18.0.22:47500, bdd554c3dc77/10.11.0.75:47500, /10.11.0.74:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1529039549468, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false, connectCnt=1, queueLimit=131072, reserveCnt=2, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=48224, resendCnt=0, rcvCnt=111504, sentCnt=48229, reserved=true, lastAck=111488, nodeLeft=false, node=TcpDiscoveryNode [id=6739c9af-42d1-4aad-ac9c-ac738ed13534, addrs=[10.11.0.74, 10.11.0.75, 127.0.0.1, 172.18.0.22], sockAddrs=[/172.18.0.22:47500, bdd554c3dc77/10.11.0.75:47500, /10.11.0.74:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1529039549468, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false, connectCnt=1, queueLimit=131072, reserveCnt=2, pairedConnections=false], super=GridNioSessionImpl [locAddr=/10.11.0.7:42970, rmtAddr=bdd554c3dc77/10.11.0.75:47100, createTime=1529039561958, closeTime=0, bytesSent=18478193, bytesRcvd=70700138, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1529044007457, lastSndTime=1529049712225, lastRcvTime=1529049712225, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser#7a15b36, directMode=true], GridConnectionBytesVerifyFilter], accepted=false]]]
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite0(GridNioServer.java:1636)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite(GridNioServer.java:1293)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2307)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2080)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1749)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
service-be - [WARN ] 2018-06-15 02:01:53.231 [grid-nio-worker-tcp-comm-0-#57] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Broken pipe]
service-be - [INFO ] 2018-06-15 02:01:53.240 [grid-nio-worker-tcp-comm-6-#63] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.11.0.7:47100, rmtAddr=/10.11.0.75:59212]
service-be - [WARN ] 2018-06-15 02:02:03.253 [tcp-comm-worker-#1] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/172.18.0.22:47100, failureDetectionTimeout=10000]
On searching with the remote node with which there seems to be trouble connecting (as mentioned in the trace above) I also see these warnings in some of the other client nodes aswell.
Any obvious pointers on what could be going wrong?. From what i have searched one suggestion was to use ipv4 but the docker overlay has enableipv6 as disabled in our case..so i am not sure how much that will help.
[root#rhel743411 logs]# egrep -i "6739c9af-42d1-4aad-ac9c-ac738ed13534" *
service1-mw.log:service1-mw - [WARN ] 2018-06-16 00:27:55.884 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:26:02.991, curTime=00:27:55.876, fut=GridDhtColocatedLockFuture [threadId=39579, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=776a8520461-6a403605-a8fd-4ed1-bd45-92e648929a2a, lockVer=GridCacheVersion [topVer=140519300, order=1529059257539, nodeOrder=6], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service1-mw.log:service1-mw - [WARN ] 2018-06-16 00:27:55.884 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:25:55.893, curTime=00:27:55.876, fut=GridDhtColocatedLockFuture [threadId=297, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=f03a8520461-6a403605-a8fd-4ed1-bd45-92e648929a2a, lockVer=GridCacheVersion [topVer=140519300, order=1529059253553, nodeOrder=6], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service1-mw.log:service1-mw - [WARN ] 2018-06-16 00:27:55.884 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:26:51.661, curTime=00:27:55.876, fut=GridDhtColocatedLockFuture [threadId=38749, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false], UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=354b8520461-6a403605-a8fd-4ed1-bd45-92e648929a2a, lockVer=GridCacheVersion [topVer=140519300, order=1529059268380, nodeOrder=6], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service1-mw.log:service1-mw - [WARN ] 2018-06-16 00:27:55.885 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:26:51.772, curTime=00:27:55.876, fut=GridDhtColocatedLockFuture [threadId=343, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false]], futId=125b8520461-6a403605-a8fd-4ed1-bd45-92e648929a2a, lockVer=GridCacheVersion [topVer=140519300, order=1529059268816, nodeOrder=6], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:01:10.227 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=23:59:12.637, curTime=00:01:10.221, fut=GridDhtColocatedLockFuture [threadId=21129, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=f5216120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529058842000, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:09:10.242 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:07:30.520, curTime=00:09:10.239, fut=GridDhtColocatedLockFuture [threadId=21304, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=42176120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529058982457, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:13:10.269 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:11:32.462, curTime=00:13:10.268, fut=GridDhtColocatedLockFuture [threadId=21368, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false]], futId=c0f96120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059041395, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:15:10.281 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:13:43.800, curTime=00:15:10.279, fut=GridDhtColocatedLockFuture [threadId=172, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false], UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=49ab6120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059079186, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:17:10.289 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:15:44.860, curTime=00:17:10.287, fut=GridDhtColocatedLockFuture [threadId=172, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=a3ec6120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059106786, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:20:10.299 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:18:51.741, curTime=00:20:10.298, fut=GridDhtColocatedLockFuture [threadId=172, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false], UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=8ace6120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059136637, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:21:10.308 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:19:19.018, curTime=00:21:10.304, fut=GridDhtColocatedLockFuture [threadId=21484, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=bd7f6120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059155514, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:24:10.326 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:23:03.860, curTime=00:24:10.323, fut=GridDhtColocatedLockFuture [threadId=21544, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false]], futId=f3e17120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059200701, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:24:10.326 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:22:52.783, curTime=00:24:10.323, fut=GridDhtColocatedLockFuture [threadId=172, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false], UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=edc17120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059199113, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:26:10.330 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:24:59.321, curTime=00:26:10.328, fut=GridDhtColocatedLockFuture [threadId=172, keys=[UserKeyCacheObjectImpl [part=7, val=7, hasValBytes=false]], futId=74737120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059232146, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]
service2-mw.log:service2y-mw - [WARN ] 2018-06-16 00:29:10.349 [grid-timeout-worker-#55] org.apache.ignite.internal.diagnostic - Found long running cache future [startTime=00:27:32.480, curTime=00:29:10.347, fut=GridDhtColocatedLockFuture [threadId=21621, keys=[UserKeyCacheObjectImpl [part=8, val=8, hasValBytes=false]], futId=1fe57120461-0c4dcfda-c90b-42a3-83c4-8d2f8ecb6ab1, lockVer=GridCacheVersion [topVer=140519300, order=1529059289421, nodeOrder=17], read=false, retval=true, err=null, timeout=120000, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], done=0, trackable=true, createTtl=-1, accessTtl=-1, skipStore=false, keepBinary=false, recovery=false, miniId=1, topVer=AffinityTopologyVersion [topVer=34, minorTopVer=0], innerFuts=[[node=6739c9af-42d1-4aad-ac9c-ac738ed13534, rcvRes=false, loc=false, done=false]], inTx=false, super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=Bool reducer: true, initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=[false]]]]]

Ignite Cluster getting stuck when new node Join or release

I have 3 node cluster with 20+ client and it's running in spark context.Initially it working fine but randomly get issue whenever new node i.e. client try to connect with cluster.The cluster getting inoperative.I have got following logs when its stuck.If I restart any Ignite server explicitly then its release and work fine.I have use Ignite 2.4.0 version. same issue produced in Ignite 2.5.0 version too.
client side Logs
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0], node=4d885cfd-45ed-43a2-8088-f35c9469797f]. Dumping pending objects that might be the cause:
GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=[0:0:0:0:0:0:0:1%lo, 10.13.10.179, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, hdn6.mstorm.com/10.13.10.179:0], discPort=0, order=44, intOrder=0, lastExchangeTime=1527651620413, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=true], done=false]
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0], node=4d885cfd-45ed-43a2-8088-f35c9469797f]. Dumping pending objects that might be the cause:
GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=[0:0:0:0:0:0:0:1%lo, 10.13.10.179, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, hdn6.mstorm.com/10.13.10.179:0], discPort=0, order=44, intOrder=0, lastExchangeTime=1527651620413, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=true], done=false]
Failed to wait for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock.
^-- Long running transactions (ignore if this is the case).
^-- Unreleased explicit locks.
Still waiting for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=
Server Side Logs
Possible starvation in striped pool. Thread name: sys-stripe-0-#1 Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtTxPrepareResponse [nearEvicted=null, futId=869dd4ca361-fe7e167d-4d80-4f57-b004-13359a9f2c11, miniId=1, super=GridDistributedTxPrepareResponse [txState=null, part=-1, err=null, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=139084030, order=1527604094903, nodeOrder=1], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=0]]]]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=984, val=null, hasValBytes=true], val=BinaryObjectImpl [arr= true, ctx=false, start=0], prevVal=null, super=GridDhtAtomicAbstractUpdateRequest [onRes=false, nearNodeId=null, nearFutId=0, flags=]]]], o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout#2735c674, Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtTxPrepareRequest [nearNodeId=628e3078-17fd-4e49-b9ae-ad94ad97a2f1, futId=6576e4ca361-6e7cdac2-d5a3-4624-9ad3-b93f25546cc3, miniId=1, topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0], invalidateNearEntries={}, nearWrites=null, owned=null, nearXidVer=GridCacheVersion [topVer=139084030, order=1527604094933, nodeOrder=2], subjId=628e3078-17fd-4e49-b9ae-ad94ad97a2f1, taskNameHash=0, preloadKeys=null, super=GridDistributedTxPrepareRequest [threadId=86, concurrency=OPTIMISTIC, isolation=READ_COMMITTED, writeVer=GridCacheVersion [topVer=139084030, order=1527604094935, nodeOrder=2], timeout=0, reads=null, writes=[IgniteTxEntry [key=BinaryObjectImpl [arr= true, ctx=false, start=0], cacheId=-1755241537, txKey=null, val=[op=UPDATE, val=BinaryObjectImpl [arr= true, ctx=false, start=0]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=null, filtersPassed=false, filtersSet=false, entry=null, prepared=0, locked=false, nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]], dhtVers=null, txSize=0, plc=2, txState=null, flags=onePhase|last, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=139084030, order=1527604094933, nodeOrder=2], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=0]]]]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=2, arr=[65774,65775]]]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=1016, val=null, hasValBytes=true], parent=GridNearAtomicAbstractSingleUpdateRequest [nodeId=null, futId=49328, topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0], parent=GridNearAtomicAbstractUpdateRequest [res=null, flags=needRes]]]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, arr=[98591]]]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, arr=[114926]]]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridNearAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=1016, val=null, hasValBytes=true], parent=GridNearAtomicAbstractSingleUpdateRequest [nodeId=null, futId=32946, topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0], parent=GridNear

org.apache.ignite.IgniteCheckedException: Failed to find class with given class loader for unmarshalling

Using Ignite 2.1, I start first node in default server mode with peer class loading enabled from command line. I see the following line in the logs:
When I start the second node (using IgniteSpringBean on a tomcat server, in client mode) I am getting the following error, even though peer class loading is enabled:
org.apache.ignite.IgniteCheckedException: Failed to find class with given class loader for unmarshalling (make sure same versions of all classes are available on all nodes or enable peer-class-loading) [clsLdr=sun.misc.Launcher$AppClassLoader#18b4aac2,...
Visor tells me that both the server and the client node are in the topology and both have peer class loading enabled...
Server logs:
[vagrant#tw apache-ignite-fabric-2.1.0-bin]$ ./bin/ignite.sh ./config/example-default.xml -v
Ignite Command Line Startup, ver. 2.1.0#20170720-sha1:a6ca5c8a
2017 Copyright(C) Apache Software Foundation
[13:41:51,967][INFO][main][IgniteKernal]
>>> __________ ________________
>>> / _/ ___/ |/ / _/_ __/ __/
>>> _/ // (7 7 // / / / / _/
>>> /___/\___/_/|_/___/ /_/ /___/
>>>
>>> ver. 2.1.0#20170720-sha1:a6ca5c8a
>>> 2017 Copyright(C) Apache Software Foundation
>>>
>>> Ignite documentation: http://ignite.apache.org
[13:41:51,967][INFO][main][IgniteKernal] Config URL: file:/home/vagrant/ignite/apache-ignite-fabric-2.1.0-bin/./config/example-default.xml
[13:41:51,968][INFO][main][IgniteKernal] Daemon mode: off
[13:41:51,968][INFO][main][IgniteKernal] OS: Linux 3.10.0-327.el7.x86_64 amd64
[13:41:51,968][INFO][main][IgniteKernal] OS user: vagrant
[13:41:51,968][INFO][main][IgniteKernal] PID: 8122
[13:41:51,968][INFO][main][IgniteKernal] Language runtime: Java Platform API Specification ver. 1.8
[13:41:51,968][INFO][main][IgniteKernal] VM information: Java(TM) SE Runtime Environment 1.8.0_60-b27 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.60-b23
[13:41:51,970][INFO][main][IgniteKernal] VM total memory: 0.97GB
[13:41:51,970][INFO][main][IgniteKernal] Remote Management [restart: on, REST: on, JMX (remote: on, port: 49122, auth: off, ssl: off)]
[13:41:51,970][INFO][main][IgniteKernal] IGNITE_HOME=/home/vagrant/ignite/apache-ignite-fabric-2.1.0-bin
[13:41:51,971][INFO][main][IgniteKernal] VM arguments: [-Xms1g, -Xmx1g, -XX:+AggressiveOpts, -XX:MaxMetaspaceSize=256m, -DIGNITE_QUIET=false, -DIGNITE_SUCCESS_FILE=/home/vagrant/ignite/apache-ignite-fabric-2.1.0-bin/work/ignite_success_96df797d-5531-4b3e-b396-5f44cdc1470e, -Dcom.sun.management.jmxremote, -Dcom.sun.management.jmxremote.port=49122, -Dcom.sun.management.jmxremote.authenticate=false, -Dcom.sun.management.jmxremote.ssl=false, -DIGNITE_HOME=/home/vagrant/ignite/apache-ignite-fabric-2.1.0-bin, -DIGNITE_PROG_NAME=./bin/ignite.sh]
[13:41:51,973][INFO][main][IgniteKernal] System cache's MemoryPolicy size is configured to 40 MB. Use MemoryConfiguration.systemCacheMemorySize property to change the setting.
[13:41:51,980][INFO][main][IgniteKernal] Configured caches [in 'sysMemPlc' memoryPolicy: ['ignite-sys-cache']]
[13:41:51,980][WARNING][main][IgniteKernal] Peer class loading is enabled (disable it in production for performance and deployment consistency reasons)
[13:41:52,002][INFO][main][IgniteKernal] 3-rd party licenses can be found at: /home/vagrant/ignite/apache-ignite-fabric-2.1.0-bin/libs/licenses
[13:41:52,077][INFO][main][IgnitePluginProcessor] Configured plugins:
[13:41:52,078][INFO][main][IgnitePluginProcessor] ^-- None
[13:41:52,078][INFO][main][IgnitePluginProcessor]
[13:41:52,138][INFO][main][TcpCommunicationSpi] Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[13:41:52,150][WARNING][main][TcpCommunicationSpi] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[13:41:52,169][WARNING][main][NoopCheckpointSpi] Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[13:41:52,196][WARNING][main][GridCollisionManager] Collision resolution is disabled (all jobs will be activated upon arrival).
[13:41:52,197][INFO][main][IgniteKernal] Security status [authentication=off, tls/ssl=off]
[13:41:52,516][INFO][main][SqlListenerProcessor] SQL connector processor has started on TCP port 10800
[13:41:52,550][INFO][main][GridTcpRestProtocol] Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
[13:41:52,593][INFO][main][IgniteKernal] Non-loopback local IPs: 10.0.10.103, 10.0.2.15, fe80:0:0:0:a00:27ff:fe51:d0d8%eth0, fe80:0:0:0:a00:27ff:fee7:1d4f%eth1
[13:41:52,593][INFO][main][IgniteKernal] Enabled local MACs: 08002751D0D8, 080027E71D4F
[13:41:52,637][INFO][main][TcpDiscoverySpi] Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=2a929c01-f8a6-4b14-9857-88eaa2b58a87]
[13:41:54,030][INFO][exchange-worker-#28%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=0], crd=true, evt=10, node=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328114016, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], evtNode=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328114016, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], customEvt=null]
[13:41:54,042][WARNING][exchange-worker-#28%null%][IgniteCacheDatabaseSharedManager] No user-defined default MemoryPolicy found; system default of 1GB size will be used.
[13:41:54,299][INFO][exchange-worker-#28%null%][GridCacheProcessor] Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL]
[13:41:54,302][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=0], waitTime=0ms]
[13:41:54,333][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=0], time=0ms]
[13:41:54,347][INFO][exchange-worker-#28%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=0], crd=true]
[13:41:54,350][INFO][exchange-worker-#28%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=12, minorTopVer=0], evt=NODE_JOINED, node=2a929c01-f8a6-4b14-9857-88eaa2b58a87]
[13:41:54,450][INFO][main][IgniteKernal] Performance suggestions for grid (fix if possible)
[13:41:54,451][INFO][main][IgniteKernal] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[13:41:54,451][INFO][main][IgniteKernal] ^-- Disable grid events (remove 'includeEventTypes' from configuration)
[13:41:54,451][INFO][main][IgniteKernal] ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[13:41:54,451][INFO][main][IgniteKernal] ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[13:41:54,451][INFO][main][IgniteKernal] ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[13:41:54,451][INFO][main][IgniteKernal] ^-- Speed up flushing of dirty pages by OS (alter vm.dirty_expire_centisecs parameter by setting to 500)
[13:41:54,451][INFO][main][IgniteKernal] ^-- Reduce pages swapping ratio (set vm.swappiness=10)
[13:41:54,451][INFO][main][IgniteKernal] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[13:41:54,451][INFO][main][IgniteKernal]
[13:41:54,451][INFO][main][IgniteKernal] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[13:41:54,451][INFO][main][IgniteKernal]
[13:41:54,459][INFO][main][IgniteKernal]
>>> +----------------------------------------------------------------------+
>>> Ignite ver. 2.1.0#20170720-sha1:a6ca5c8a97e9a4c9d73d40ce76d1504c14ba1940
>>> +----------------------------------------------------------------------+
>>> OS name: Linux 3.10.0-327.el7.x86_64 amd64
>>> CPU(s): 1
>>> Heap: 1.0GB
>>> VM name: 8122#tw.dna.com
>>> Local node [ID=2A929C01-F8A6-4B14-9857-88EAA2B58A87, order=12, clientMode=false]
>>> Local node addresses: [10.0.10.103/0:0:0:0:0:0:0:1%lo, 10.0.2.15/10.0.10.103, /10.0.2.15, /127.0.0.1]
>>> Local ports: TCP:10800 TCP:11211 TCP:47100 UDP:47400 TCP:47500
[13:41:54,462][INFO][main][GridDiscoveryManager] Topology snapshot [ver=12, servers=1, clients=0, CPUs=1, heap=1.0GB]
[13:42:54,444][INFO][grid-timeout-worker-#15%null%][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=2a929c01, name=null, uptime=00:01:00:007]
^-- H/N/C [hosts=1, nodes=1, CPUs=1]
^-- CPU [cur=2.33%, avg=1.57%, GC=0%]
^-- PageMemory [pages=200]
^-- Heap [used=107MB, free=89.12%, comm=989MB]
^-- Non heap [used=36MB, free=97.59%, comm=37MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
^-- Outbound messages queue [size=0]
[13:43:46,444][INFO][disco-event-worker-#27%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=c8c42745-f838-48ea-9145-5783a6f77681, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.101, 10.0.2.15, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /10.0.10.101:0, /10.0.2.15:0], discPort=0, order=13, intOrder=8, lastExchangeTime=1505328226398, loc=false, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=true]
[13:43:46,446][INFO][disco-event-worker-#27%null%][GridDiscoveryManager] Topology snapshot [ver=13, servers=1, clients=1, CPUs=2, heap=3.0GB]
[13:43:46,448][INFO][exchange-worker-#28%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true, evt=10, node=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328226435, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], evtNode=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328226435, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], customEvt=null]
[13:43:46,448][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], time=0ms]
[13:43:46,449][INFO][exchange-worker-#28%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true]
[13:43:46,449][INFO][exchange-worker-#28%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=13, minorTopVer=0], evt=NODE_JOINED, node=c8c42745-f838-48ea-9145-5783a6f77681]
[13:43:47,121][INFO][grid-nio-worker-tcp-comm-0-#17%null%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/10.0.10.103:47100, rmtAddr=/10.0.10.101:54857]
[13:43:47,357][INFO][exchange-worker-#28%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], crd=true, evt=18, node=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328227343, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], evtNode=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328227343, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], customEvt=DynamicCacheChangeBatch [id=dcedd8c7e51-9d6cee64-90a5-4c0b-a1ed-b4c7a1697bfb, reqs=[DynamicCacheChangeRequest [cacheName=ignite-sys-atomic-cache#dna-EVENT_DELIVERY_SET, hasCfg=true, nodeId=c8c42745-f838-48ea-9145-5783a6f77681, clientStartOnly=false, stop=false, destroy=false]], exchangeActions=ExchangeActions [startCaches=[ignite-sys-atomic-cache#dna-EVENT_DELIVERY_SET], stopCaches=null, startGrps=[dna-EVENT_DELIVERY_SET], stopGrps=[], resetParts=null, stateChangeRequest=null], startCaches=false]]
[13:43:47,378][INFO][exchange-worker-#28%null%][GridCacheProcessor] Started cache [name=ignite-sys-atomic-cache#dna-EVENT_DELIVERY_SET, group=dna-EVENT_DELIVERY_SET, memoryPolicyName=default, mode=PARTITIONED, atomicity=TRANSACTIONAL]
[13:43:47,379][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], waitTime=0ms]
[13:43:47,496][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], time=0ms]
[13:43:47,512][INFO][exchange-worker-#28%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], crd=true]
[13:43:47,515][INFO][exchange-worker-#28%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=13, minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, node=c8c42745-f838-48ea-9145-5783a6f77681]
[13:43:47,558][INFO][exchange-worker-#28%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=2], crd=true, evt=18, node=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328227557, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], evtNode=TcpDiscoveryNode [id=2a929c01-f8a6-4b14-9857-88eaa2b58a87, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.10.103, 10.0.2.15, 127.0.0.1], sockAddrs=[/10.0.10.103:47500, /10.0.2.15:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1505328227557, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], customEvt=DynamicCacheChangeBatch [id=3dedd8c7e51-9d6cee64-90a5-4c0b-a1ed-b4c7a1697bfb, reqs=[DynamicCacheChangeRequest [cacheName=datastructures_ATOMIC_PARTITIONED_0#dna-EVENT_DELIVERY_SET, hasCfg=true, nodeId=c8c42745-f838-48ea-9145-5783a6f77681, clientStartOnly=false, stop=false, destroy=false]], exchangeActions=ExchangeActions [startCaches=[datastructures_ATOMIC_PARTITIONED_0#dna-EVENT_DELIVERY_SET], stopCaches=null, startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null], startCaches=false]]
[13:43:47,597][INFO][exchange-worker-#28%null%][GridCacheProcessor] Started cache [name=datastructures_ATOMIC_PARTITIONED_0#dna-EVENT_DELIVERY_SET, group=dna-EVENT_DELIVERY_SET, memoryPolicyName=default, mode=PARTITIONED, atomicity=ATOMIC]
[13:43:47,597][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=2], waitTime=0ms]
[13:43:47,623][INFO][exchange-worker-#28%null%][GridDhtPartitionsExchangeFuture] Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=2], time=0ms]
[13:43:47,625][INFO][exchange-worker-#28%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=2], crd=true]
[13:43:47,626][INFO][exchange-worker-#28%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=13, minorTopVer=2], evt=DISCOVERY_CUSTOM_EVT, node=c8c42745-f838-48ea-9145-5783a6f77681]
[13:43:47,915][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Failed to unmarshal discovery custom message.
class org.apache.ignite.IgniteCheckedException: Failed to find class with given class loader for unmarshalling (make sure same versions of all classes are available on all nodes or enable peer-class-loading) [clsLdr=sun.misc.Launcher$AppClassLoader#18b4aac2, cls=scan.fragment.node.ignite.VersionedInterceptor]
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:143)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:82)
at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9733)
at org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryCustomEventMessage.message(TcpDiscoveryCustomEventMessage.java:81)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.notifyDiscoveryListener(ServerImpl.java:5436)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processCustomMessage(ServerImpl.java:5321)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2629)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2420)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6576)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2506)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.lang.ClassNotFoundException: scan.fragment.node.ignite.VersionedInterceptor
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.ignite.internal.util.IgniteUtils.forName(IgniteUtils.java:8465)
at org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.resolveClass(JdkMarshallerObjectInputStream.java:54)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at java.util.ArrayList.readObject(ArrayList.java:791)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:121)
... 12 more
Peer class loading is working with Compute Grid [1] only. It looks like your VersionedInterceptor is part of cache configuration (implementation of CacheInterceptor?), such classes have to be explicitly deployed on all nodes prior to cluster start.