org.infinispan.util.concurrent.TimeoutException: Replication timeout for "node-name" - infinispan
We have three services that have to be in cluster. So we used Infinispan for cluster the nodes and share data between those services. After successfully restarted, sometimes I am getting an exception and got received "View Changed" event for lefted node in my one of other node. Actually all the node were running. I could not figure the cause of this.
I am using Infinispan 8.1.3-distributed cache, jgroups-3.4
org.infinispan.util.concurrent.TimeoutException: Replication timeout for sipproxy-16964
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:765)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:599)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport$$Lambda$9/1547262581.apply(Unknown Source)
at java.util.concurrent.CompletableFuture$ThenApply.run(CompletableFuture.java:717)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2345)
at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46)
at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2017-08-22 04:44:52,902 INFO [JGroupsTransport] (ViewHandler,ISPN,transport_manager-48870) ISPN000094: Received new cluster view for channel ISPN: [transport_manager-48870|3] (2) [transport_manager-48870, mediaproxy-47178]
2017-08-22 04:44:52,949 WARN [PreferAvailabilityStrategy] (transport-thread-transport_manager-p4-t24) ISPN000313: Cache mediaProxyResponseCache lost data because of abrupt leavers [sipproxy-16964]
2017-08-22 04:44:52,951 WARN [ClusterTopologyManagerImpl] (transport-thread-transport_manager-p4-t24) ISPN000197: Error updating cluster member list
java.lang.IllegalArgumentException: There must be at least one node with a non-zero capacity factor
at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.checkCapacityFactors(DefaultConsistentHashFactory.java:57)
at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:74)
at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:26)
at org.infinispan.topology.ClusterCacheStatus.updateCurrentTopology(ClusterCacheStatus.java:431)
at org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy.onClusterViewChange(PreferAvailabilityStrategy.java:56)
at org.infinispan.topology.ClusterCacheStatus.doHandleClusterView(ClusterCacheStatus.java:337)
at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:397)
at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:314)
at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:571)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
jgroups.xml:
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
<TCP bind_addr="131.10.20.16"
bind_port="8010" port_range="10"
recv_buf_size="20000000"
send_buf_size="640000"
loopback="false"
max_bundle_size="64k"
bundler_type="old"
enable_diagnostics="true"
thread_naming_pattern="cl"
timer_type="new"
timer.min_threads="4"
timer.max_threads="30"
timer.keep_alive_time="3000"
timer.queue_max_size="100"
timer.wheel_size="200"
timer.tick_time="50"
thread_pool.enabled="true"
thread_pool.min_threads="2"
thread_pool.max_threads="30"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="2"
oob_thread_pool.max_threads="30"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="discard"/>
<TCPPING initial_hosts="131.10.20.16[8010],131.10.20.17[8010],131.10.20.182[8010]" port_range="2"
timeout="3000" num_initial_members="3" />
<MERGE3 max_interval="30000"
min_interval="10000"/>
<FD_SOCK/>
<FD_ALL interval="3000" timeout="10000" />
<VERIFY_SUSPECT timeout="500" />
<BARRIER />
<pbcast.NAKACK use_mcast_xmit="false"
retransmit_timeout="100,300,600,1200"
discard_delivered_msgs="true" />
<UNICAST3 conn_expiry_timeout="0"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="10m"/>
<pbcast.GMS print_local_addr="true" join_timeout="5000"
max_bundling_time="30"
view_bundling="true"/>
<UFC max_credits="2M"
min_threshold="0.4"/>
<MFC max_credits="2M"
min_threshold="0.4"/>
<FRAG2 frag_size="60000" />
<pbcast.STATE_TRANSFER/>
</config>
TimeoutException says only that a response for an RPC has not been received within timeout, nothing more. That could happen when the server is under stress, but that's probably not the case here - the following logs say that the node was 'suspected' - the node was probably non-responsive for more than 10 seconds (that's the limit in your configuration, see FD_ALL).
First thing check the log in that server for errors and also GC logs for any stop-the-world pauses.
As #flavius suggested the main cause is the one of your nodes stopped for some reason and failed to reply to an RPC.
I suggest to change the logging level of JGroups so that you could see why a node has been suspected (it may happen by either FD_SOCK or FD_ALL protocol) and why it was eliminated from the view (it is very likely that this happened because of VERIFY_SUSPECT protocol).
You could also check why that happened. In most of the cases it is caused by long GC pauses. But your VM can also be paused by the host machine for other reasons. I suggest using JHiccup in both VM and attach it as Java Agent to your process. This way you should notice whether it's JVM Stop The World caused this or was the OS.
Related
Scaling Apache Ignite Grid
We having scaling Apache Ignite grid where client nodes scale up and down based on load. Data nodes are our server nodes where continuous queries run. However this leads to unclean shutdown of some client nodes as we rely on SIGTERM for Ignite node shutdown. Unclean shutdown of client nodes impacts excution of continuos query which starts giving "Possible starvation in striped pool" warning ultimately leading to Blocked system-critical threads. We are currently working on ways to prevent striped pool stravation and have noticed 2 key issues around it: Continuos query thread trying to connect to nodes which have shutdown but are still present in topology: We are planning to reduce the timeout so that client node is discarded early from grid. Stacktrace: Thread [name="sys-stripe-1-#2%App%", id=37, state=RUNNABLE, blockCnt=233817, waitCnt=3343945] at sun.nio.ch.Net.poll(Native Method) at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3781) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960) at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:2100) at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:2365) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1964) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1935) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1917) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1324) at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1261) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:1059) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$600(CacheContinuousQueryHandler.java:90) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$2.onEntryUpdated(CacheContinuousQueryHandler.java:459) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447) Continuos query threads waiting for read lock while trying to update the cache. This is generally comes up after retries for Client node connection are over. Stacktrace: Possible starvation in striped pool. Thread name: sys-stripe-12-#13%App% Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=37b43550-d3a5-4518-8745-ece5dc06b1fd, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 83=7999, 595=7688, 596=7972, 85=7494, 597=7806, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 125=7768, 637=7662, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 652=7908, 140=7478, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 180=8026, 692=7674, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 731=7790, 219=7594, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=0e950ae5-1474-4488-9042-80dbddb2f09a, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 595=7688, 83=7999, 596=7972, 597=7806, 85=7494, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 637=7662, 125=7768, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 140=7478, 652=7908, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 692=7674, 180=8026, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 219=7594, 731=7790, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]]] Deadlock: false Completed: 3316358 Thread [name="sys-stripe-12-#13%App%", id=48, state=WAITING, blockCnt=106311, waitCnt=1659827] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#5f611d9a, ownerName=exchange-worker-#71%App%, ownerId=138] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.readLock(GridDhtPartitionTopologyImpl.java:256) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1837) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1734) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:141) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:273) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:268) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) at o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:241) at o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) at o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55) at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Here we can see that lock is owned by "exchange-worker-#71%App%" which seems to be struck. In few cases we have seen that lock has no owner specific: Thread [name="sys-stripe-2-#3%App%", id=43, state=WAITING, blockCnt=39097, waitCnt=394328] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#667500d1, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1663) Continuos query runs on server nodes which are our data nodes and we do not expect data nodes to be impacted by client nodes like getting locked. Can someone advice on how we can avoid such locks given that nodes can have unclean shutdowns?
I believe setting IGNITE_ENABLE_FORCIBLE_NODE_KILL property as true cluster-wide could help with the matter. It streamlines the process of kicking thick client nodes off a cluster, the main case for it is abruptly terminated client nodes.
Unexpected "Internal error" exception when using spring-cloud-gcp-pubsub 3.2.1
I'm using PubSubReactiveFactory fromspring-cloud-gcp-pubsub onOpenJDK 11 Debian Linux and I've observed the following exception in our application: com.google.api.gax.rpc.InternalException: io.grpc.StatusRuntimeException: INTERNAL: http2 exception at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:110) at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66) at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67) at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132) at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270) at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038) at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808) at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:572) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:542) at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:535) at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: io.grpc.StatusRuntimeException: INTERNAL: http2 exception at io.grpc.Status.asRuntimeException(Status.java:535) ... 14 common frames omitted Caused by: io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:172) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams$2.process(DefaultHttp2Connection.java:968) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.decrementPendingIterations(DefaultHttp2Connection.java:1029) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.forEachActiveStream(DefaultHttp2Connection.java:984) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.forEachActiveStream(DefaultHttp2Connection.java:209) at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler.goingAway(NettyClientHandler.java:839) at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler.access$200(NettyClientHandler.java:91) at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler$2.onGoAwayReceived(NettyClientHandler.java:278) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.goAwayReceived(DefaultHttp2Connection.java:237) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.onGoAwayRead0(DefaultHttp2ConnectionDecoder.java:217) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onGoAwayRead(DefaultHttp2ConnectionDecoder.java:583) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onGoAwayRead(Http2InboundFrameLogger.java:119) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readGoAwayFrame(DefaultHttp2FrameReader.java:580) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:271) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41) at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371) at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234) at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 common frames omitted The "Internal" gRPC status is propagated to application code and force us to retry pull operation polluting logs with errors/warnings along the way. To add more context this is happening when Pub/Sub PullRequest API takes long time (I'm observing p99 20-30 seconds latency when this happens). Netty closes the connection with "Stream closed before write could take place" in DefaultHttp2RemoteFlowController.java:481 and status io.netty.handler.codec.http2.Http2Error.STREAM_CLOSED(0x5) Then this status code is translated to io.grpc.internal.Http2Error.INTERNAL and propagated up the stack. Has anybody experience this error and come up with a way to gracefully handle it?
Stripped pool starvation in WAL writing causes node cluster node failure
Moderate workload on 3 node ignite cluster causes one node to fail with stripped pool startvation while archiving WAL. This happens one or two times in week. I already checked all IO problems which could hang WAL rollover. But this issue still persist I am using latest ignite 2.7 as a library inside spring boot application : >>> Possible starvation in striped pool. Deadlock: false Completed: 1397 Thread [name="sys-stripe-7-#8%server.node%", id=22, state=WAITING, blockCnt=3, waitCnt=757] Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.awaitNext(FileWriteAheadLogManager.java:2871) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$2300(FileWriteAheadLogManager.java:2451) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1205) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:836) at o.a.i.i.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4267) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6333) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6082) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5782) at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3719) at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5900(BPlusTree.java:3613) at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1895) at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1779) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1638) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621) at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428) at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2295) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3242) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:135) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:309) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:304) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197) at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127) at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093) at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:505) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) ERROR --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-1, blockedFor=10s] WARN --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Thread [name="sys-stripe-1-#2%server.node%", id=16, state=WAITING, blockCnt=0, waitCnt=754] Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
Failure Detection feature is not very well configured in Apache Ignite 2.7 by default. You can turn it off (by setting to NoOp) or set a large failureDetectionTimeout to avoid such messages (and shutdown of nodes).
Possible bug if doing parallelstream (distributed) computation on *EMPTY* Infinispan cache?
I'm a newbie to Infinispan and have been playing with it. I think I found a bug. If a cache is empty and we have two nodes running (and the nodes must actually be on separate virtual machines -- it can't just be 2 jvm processes), and then a 2nd process on node#1 does a parallel streaming operation as follows: final List<String> max = c .entrySet() .parallelStream() .map(e -> e.getKey().substring(0,1)) .collect(() -> Collectors.toList()); You'll get the following error (the main error is I think " Invalid lambda deserialization"). NOTE: This error does NOT occur if the cache is populated with some data. I tried (reasonably hard) to trace the code, but couldn't figure out the issue, though I suspect it has something to do with deserializing an empty collector of some sort....Has anyone seen this or think this is a likely bug (as opposed to user error? I've tried many many things before posting....). Any workaround? (Or maybe it's a super easy "real" fix in the actual relevant code?) Exception in thread "main" org.infinispan.remoting.RemoteException: ISPN000217: Received exception from proteowizard-dev2-16cpu-43102(rack-id=qa-rack, machine-id=qa-machine1), see cause for remote stack trace at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:27) at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37) at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21) at org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52) at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:35) at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:52) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1372) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1275) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126) at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1420) at org.jgroups.JChannel.up(JChannel.java:816) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:893) at org.jgroups.protocols.FRAG3.up(FRAG3.java:171) at org.jgroups.protocols.FlowControl.up(FlowControl.java:343) at org.jgroups.protocols.pbcast.GMS.up(GMS.java:873) at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240) at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1003) at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:729) at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:384) at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:600) at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:130) at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:203) at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:253) at org.jgroups.protocols.MERGE3.up(MERGE3.java:280) at org.jgroups.protocols.Discovery.up(Discovery.java:269) at org.jgroups.protocols.TP.passMessageUp(TP.java:1248) at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: Unexpected exception at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadResolve(JDKSpecific.java:260) at org.jboss.marshalling.reflect.SerializableClass.callReadResolve(SerializableClass.java:271) at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1396) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:272) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:205) at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41) at org.infinispan.marshall.core.ExternalJBossMarshaller.objectFromObjectStream(ExternalJBossMarshaller.java:47) at org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:873) at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:697) at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:361) at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:40) at org.infinispan.stream.impl.intops.IntermediateOperationExternalizer.readObject(IntermediateOperationExternalizer.java:377) at org.infinispan.stream.impl.intops.IntermediateOperationExternalizer.readObject(IntermediateOperationExternalizer.java:92) at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:708) at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:691) at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:361) at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:40) at org.infinispan.commons.marshall.MarshallUtil.lambda$unmarshallCollection$0(MarshallUtil.java:284) at org.infinispan.commons.marshall.MarshallUtil.unmarshallCollection(MarshallUtil.java:267) at org.infinispan.commons.marshall.MarshallUtil.unmarshallCollection(MarshallUtil.java:284) at org.infinispan.marshall.exts.CollectionExternalizer.readObject(CollectionExternalizer.java:120) at org.infinispan.marshall.exts.CollectionExternalizer.readObject(CollectionExternalizer.java:27) at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:708) at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:691) at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:361) at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:40) at org.infinispan.stream.impl.termop.TerminalOperationExternalizer.readObject(TerminalOperationExternalizer.java:192) at org.infinispan.stream.impl.termop.TerminalOperationExternalizer.readObject(TerminalOperationExternalizer.java:42) at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:708) at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:691) at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:361) at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:40) at org.infinispan.stream.impl.StreamRequestCommand.readFrom(StreamRequestCommand.java:143) at org.infinispan.marshall.exts.ReplicableCommandExternalizer.readCommandParameters(ReplicableCommandExternalizer.java:104) at org.infinispan.marshall.exts.CacheRpcCommandExternalizer.readObject(CacheRpcCommandExternalizer.java:132) at org.infinispan.marshall.exts.CacheRpcCommandExternalizer.readObject(CacheRpcCommandExternalizer.java:66) at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:708) at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:691) at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:361) at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:194) at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:223) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1332) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1272) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126) at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1420) at org.jgroups.JChannel.up(JChannel.java:816) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:893) at org.jgroups.protocols.FRAG3.up(FRAG3.java:171) at org.jgroups.protocols.FlowControl.up(FlowControl.java:351) ... 16 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadResolve(JDKSpecific.java:250) ... 64 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at com.deepdia.deepsearch.modules.infinispan.StartClient.$deserializeLambda$(StartClient.java:32) ... 74 more Caused by: an exception which occurred: in object of type java.lang.invoke.SerializedLambda -> classloader hierarchy: (Other details: Infinispan 9.4.1 embedded; OS --> Win Server 2016; oracle jdk8; the windows machines are running as VMs on Google Compute, but NOT kubernetes; jgroups was configured to use TCPPing with hardcoded hostnames) Also, although I don't think it's pertinent, here's the cache definition: Configuration conf = new ConfigurationBuilder() .memory() .evictionStrategy(EvictionStrategy.REMOVE) //I think this is default .size(SIZE) .unsafe() .unreliableReturnValues(false) .clustering() .cacheMode(CacheMode.DIST_ASYNC) .hash() .numOwners(1) .numSegments(100) .capacityFactor(capacityFactor) .persistence() .passivation(true) .addStore(RocksDBStoreConfigurationBuilder.class) .location("C:\\Users\\gsaxena888\\Downloads\\temp" + vmNum + "\\data") .expiredLocation("C:\\Users\\gsaxena888\\Downloads\\temp\\" + vmNum + "\\expired") .segmented(true) .shared(false) .async() .enabled(true) .threadPoolSize(1) .modificationQueueSize(SIZE) .build(); String newCacheName = "distributedWithL1"; manager.defineConfiguration(newCacheName, conf); Cache</*StringHolderWorking*/String, /*StringHolderWorking*/String> c = manager.getCache(newCacheName); And, although I don't think this either is pertinent, here's the ggroups xml file: <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd"> <TCP bind_addr="match-address:10.*" bind_port="${jgroups.tcp.port:7800}" enable_diagnostics="false" thread_naming_pattern="pl" send_buf_size="640k" sock_conn_timeout="300" bundler_type="no-bundler" thread_pool.min_threads="${jgroups.thread_pool.min_threads:0}" thread_pool.max_threads="${jgroups.thread_pool.max_threads:200}" thread_pool.keep_alive_time="60000" /> <TCPPING initial_hosts="10.240.0.27[7800],10.240.0.9[7800]" /> <MERGE3 min_interval="10000" max_interval="30000" /> <FD_SOCK /> <!-- Suspect node `timeout` to `timeout + timeout_check_interval` millis after the last heartbeat --> <FD_ALL timeout="10000" interval="2000" timeout_check_interval="1000" /> <VERIFY_SUSPECT timeout="1000"/> <pbcast.NAKACK2 use_mcast_xmit="false" xmit_interval="100" xmit_table_num_rows="50" xmit_table_msgs_per_row="1024" xmit_table_max_compaction_time="30000" resend_last_seqno="true" /> <UNICAST3 xmit_interval="100" xmit_table_num_rows="50" xmit_table_msgs_per_row="1024" xmit_table_max_compaction_time="30000" /> <pbcast.STABLE stability_delay="500" desired_avg_gossip="5000" max_bytes="1M" /> <pbcast.GMS print_local_addr="false" join_timeout="${jgroups.join_timeout:5000}" /> <MFC max_credits="2m" min_threshold="0.40" /> <FRAG3/> </config>
JobTracker - High memory and native thread usage
We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS. Hadoop version: 1.2.1 Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1 Observed behavior: JT will accumulate threads in waiting state, leading to OOM: 2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371) at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275) at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444) at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860) at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706) at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) After looking through the JT logs I found these warnings: 2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010 java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150) at org.apache.hadoop.ipc.Client.call(Client.java:1118) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392) at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987) Caused by: java.io.IOException: Couldn't set up IO streams at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249) at org.apache.hadoop.ipc.Client.call(Client.java:1093) ... 9 more Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635) ... 12 more This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606 I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :) I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such: pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x000000074d86ba60> (a java.io.PipedInputStream) at java.io.PipedInputStream.read(PipedInputStream.java:327) - locked <0x000000074d86ba60> (a java.io.PipedInputStream) at java.io.PipedInputStream.read(PipedInputStream.java:378) - locked <0x000000074d86ba60> (a java.io.PipedInputStream) at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181) at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque st(MediaHttpUploader.java:629) at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader. java:409) at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr actGoogleClientRequest.java:419) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr actGoogleClientRequest.java:343) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl eClientRequest.java:460) at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo ogleAsyncWriteChannel.java:354) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker) It appears JT is having hard time keeping up communicating with GCS via GCS Connector. Please advise, Thank you
At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc. In general, there are two possible causes for the OOM you're running into: You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause. Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it. I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs. TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.