ignite: possible starvation in striped pool with igniteCache - ignite
there:
I got the error when use ignite cache.
My system select a master node use zookeeper,and has many slave nodes.The master process ignite cache expired values and put in an ignite queue.The slave node provide data into ignite cache use streamer.addData(k,v) and consume the ignite queue.
My code is:
ignite cache and streamer :
// use zookeeper IpFinder
ignite = Ignition.getOrStart(igniteConfiguration);
igniteCache = ignite.getOrCreateCache(cacheConfiguration);
igniteCache.registerCacheEntryListener(new MutableCacheEntryListenerConfiguration<>(
(Factory<CacheEntryListener<K, CountValue>>)() -> (CacheEntryExpiredListener<K, CountValue>)this
::onCacheExpired, null, true, true));
//onCacheExpired master resolve the expired entry and put in igniteQueue
cacheConfiguration.setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(Duration.ONE_MINUTE));
igniteDataStreamer = ignite.dataStreamer(igniteCache.getName());
igniteDataStreamer.deployClass(BaseIgniteStreamCount.class);
igniteDataStreamer.allowOverwrite(true);
igniteDataStreamer.receiver(StreamTransformer.from((CacheEntryProcessor<K, CountValue, Object>)(e, arg) -> {
// process the value.
return null;
}));
master process the entry expired from the cache,and put in ignite queue:
CollectionConfiguration collectionConfiguration = new CollectionConfiguration().setCollocated(true);
queue = ignite.queue(igniteQueueName, 0, collectionConfiguration);
the slaves consume the queue.
but i got error log below after running hours later:
2017-09-14 17:06:45,256 org.apache.ignite.logger.java.JavaLogger warning
WARNING: >>> Possible starvation in striped pool.
Thread name: sys-stripe-6-#7%ignite%
Queue: []
Deadlock: false
Completed: 77168
Thread [name="sys-stripe-6-#7%ignite%", id=134, state=WAITING, blockCnt=0, waitCnt=68842]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:176)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:139)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:935)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:850)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$700(CacheContinuousQueryHandler.java:82)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$1.onEntryUpdated(CacheContinuousQueryHandler.java:413)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryExpired(CacheContinuousQueryManager.java:429)
at o.a.i.i.processors.cache.GridCacheMapEntry.onExpired(GridCacheMapEntry.java:3046)
at o.a.i.i.processors.cache.GridCacheMapEntry.onTtlExpired(GridCacheMapEntry.java:2961)
at o.a.i.i.processors.cache.GridCacheTtlManager$1.applyx(GridCacheTtlManager.java:61)
at o.a.i.i.processors.cache.GridCacheTtlManager$1.applyx(GridCacheTtlManager.java:52)
at o.a.i.i.util.lang.IgniteInClosure2X.apply(IgniteInClosure2X.java:38)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.expire(IgniteCacheOffheapManagerImpl.java:1007)
at o.a.i.i.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:198)
at o.a.i.i.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:160)
at o.a.i.i.processors.cache.GridCacheUtils.unwindEvicts(GridCacheUtils.java:854)
at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1073)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:561)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:378)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:304)
at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:99)
at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:293)
at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:126)
at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1097)
at o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:483)
at java.lang.Thread.run(Thread.java:745)
Striped pool is responsible for messages processing. This warning tells you that there is no progress happening on some of the stripes. It may happen due to a bad network connection or when you put massive objects to a cache or a queue.
You may find more information about it in these threads:
http://apache-ignite-users.70518.x6.nabble.com/Possible-starvation-in-striped-pool-td14892.html
http://apache-ignite-users.70518.x6.nabble.com/Possible-starvation-in-striped-pool-message-td15993.html
Related
Scaling Apache Ignite Grid
We having scaling Apache Ignite grid where client nodes scale up and down based on load. Data nodes are our server nodes where continuous queries run. However this leads to unclean shutdown of some client nodes as we rely on SIGTERM for Ignite node shutdown. Unclean shutdown of client nodes impacts excution of continuos query which starts giving "Possible starvation in striped pool" warning ultimately leading to Blocked system-critical threads. We are currently working on ways to prevent striped pool stravation and have noticed 2 key issues around it: Continuos query thread trying to connect to nodes which have shutdown but are still present in topology: We are planning to reduce the timeout so that client node is discarded early from grid. Stacktrace: Thread [name="sys-stripe-1-#2%App%", id=37, state=RUNNABLE, blockCnt=233817, waitCnt=3343945] at sun.nio.ch.Net.poll(Native Method) at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3781) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960) at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:2100) at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:2365) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1964) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1935) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1917) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1324) at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1261) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:1059) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$600(CacheContinuousQueryHandler.java:90) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$2.onEntryUpdated(CacheContinuousQueryHandler.java:459) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447) Continuos query threads waiting for read lock while trying to update the cache. This is generally comes up after retries for Client node connection are over. Stacktrace: Possible starvation in striped pool. Thread name: sys-stripe-12-#13%App% Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=37b43550-d3a5-4518-8745-ece5dc06b1fd, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 83=7999, 595=7688, 596=7972, 85=7494, 597=7806, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 125=7768, 637=7662, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 652=7908, 140=7478, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 180=8026, 692=7674, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 731=7790, 219=7594, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=0e950ae5-1474-4488-9042-80dbddb2f09a, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 595=7688, 83=7999, 596=7972, 597=7806, 85=7494, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 637=7662, 125=7768, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 140=7478, 652=7908, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 692=7674, 180=8026, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 219=7594, 731=7790, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]]] Deadlock: false Completed: 3316358 Thread [name="sys-stripe-12-#13%App%", id=48, state=WAITING, blockCnt=106311, waitCnt=1659827] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#5f611d9a, ownerName=exchange-worker-#71%App%, ownerId=138] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.readLock(GridDhtPartitionTopologyImpl.java:256) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1837) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1734) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:141) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:273) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:268) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) at o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:241) at o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) at o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55) at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Here we can see that lock is owned by "exchange-worker-#71%App%" which seems to be struck. In few cases we have seen that lock has no owner specific: Thread [name="sys-stripe-2-#3%App%", id=43, state=WAITING, blockCnt=39097, waitCnt=394328] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#667500d1, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1663) Continuos query runs on server nodes which are our data nodes and we do not expect data nodes to be impacted by client nodes like getting locked. Can someone advice on how we can avoid such locks given that nodes can have unclean shutdowns?
I believe setting IGNITE_ENABLE_FORCIBLE_NODE_KILL property as true cluster-wide could help with the matter. It streamlines the process of kicking thick client nodes off a cluster, the main case for it is abruptly terminated client nodes.
Facing issue "WebClientRequestException: Pending acquire queue has reached its maximum size of 1000" with spring reactive webClient
I am running load of a microservice API, which involves calling other microservice API using Spring Reactive Webclient. I am using Postman runner tab to test this. Firstly, i run the load with 1500 iteration, second microservice is getting called for each request and everything is working fine as expected. But when i run the load with 5000 iteration, second microservice is getting called for for 3500 times and 1500 calls are failing due to issue WebClientRequestException: Pending acquire queue has reached its maximum size of 1000 Using org.springframework.web.reactive.function.client.WebClient with default configuration, below is the code snippet. private WebClient webClient; #PostConstruct public void init() { this.webClient = WebClient.builder().defaultHeader(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE) .build(); } what can be done to avoid this? I am using latest spring-boot-starter-parent dependency (with version 2.5.3) with spring-webflux-5.3.9.jar jar. the logs: reactor.core.Exceptions$ErrorCallbackNotImplemented: reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3 Caused by: reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3 at reactor.core.Exceptions.retryExhausted(Exceptions.java:290) at reactor.util.retry.RetryBackoffSpec.lambda$static$0(RetryBackoffSpec.java:67) at reactor.util.retry.RetryBackoffSpec.lambda$generateCompanion$4(RetryBackoffSpec.java:557) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:375) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerComplete(FluxConcatMap.java:296) at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onComplete(FluxConcatMap.java:885) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1817) at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:232) at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187) at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271) at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286) at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68) at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) **Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: Pending acquire queue has reached its maximum size of 1000; nested exception is reactor.netty.internal.shaded.reactor.pool.PoolAcquirePendingLimitException: Pending acquire queue has reached its maximum size of 1000** at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141) Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: Error has been observed at the following site(s): |_ checkpoint ⇢ Request to POST http://172.20.0.2:3130/v1/login/mobile [DefaultWebClient] Stack trace: at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141) at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55) at reactor.core.publisher.Mono.subscribe(Mono.java:4338) at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103) at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222) at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222) at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222) at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93) at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204) at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124) at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225) at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:414) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.onNext(FluxConcatMap.java:251) at reactor.core.publisher.EmitterProcessor.drain(EmitterProcessor.java:491) at reactor.core.publisher.EmitterProcessor.tryEmitNext(EmitterProcessor.java:299) at reactor.core.publisher.SinkManySerialized.tryEmitNext(SinkManySerialized.java:100) at reactor.core.publisher.InternalManySink.emitNext(InternalManySink.java:27) at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onError(FluxRetryWhen.java:190) at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189) at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onError(HttpClientConnect.java:304) at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189) at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onError(DefaultPooledConnectionProvider.java:172) at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.fail(AbstractPool.java:444) at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.pendingOffer(SimpleDequePool.java:543) at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:266) at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:399) at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onSubscribe(DefaultPooledConnectionProvider.java:212) at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:674) at reactor.netty.resources.PooledConnectionProvider.lambda$acquire$1(PooledConnectionProvider.java:137) at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57) at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.lambda$subscribe$0(HttpClientConnect.java:268) at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57) at reactor.core.publisher.FluxRetryWhen.subscribe(FluxRetryWhen.java:77) at reactor.core.publisher.MonoRetryWhen.subscribeOrReturn(MonoRetryWhen.java:46) at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:57) at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:271) at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64) at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52) at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64) at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.resubscribe(FluxRetryWhen.java:216) at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onNext(FluxRetryWhen.java:269) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerNext(FluxConcatMap.java:282) at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onNext(FluxConcatMap.java:861) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816) at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:232) at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187) at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271) at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286) at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68) at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) **Caused by: reactor.netty.internal.shaded.reactor.pool.PoolAcquirePendingLimitException: Pending acquire queue has reached its maximum size of 1000 at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.pendingOffer(SimpleDequePool.java:543)** at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:266) at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:399) at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onSubscribe(DefaultPooledConnectionProvider.java:212) at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:674) at reactor.netty.resources.PooledConnectionProvider.lambda$acquire$1(PooledConnectionProvider.java:137) at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57) at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.lambda$subscribe$0(HttpClientConnect.java:268) at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57) at reactor.core.publisher.FluxRetryWhen.subscribe(FluxRetryWhen.java:77) at reactor.core.publisher.MonoRetryWhen.subscribeOrReturn(MonoRetryWhen.java:46)
WebClient needs an HTTP client library to perform requests with and by default it uses Reactor Netty. Quote from the Reactor-netty reference docs By default, Reactor Netty client uses a “fixed” connection pool with 500 as the maximum number of active channels and 1000 as the maximum number of further channel acquisition attempts allowed to be kept in a pending state (for the rest of the configurations check the system properties or the builder configurations below). This means that the implementation creates a new channel if someone tries to acquire a channel as long as less than 500 have been created and are managed by the pool. When the maximum number of channels in the pool is reached, up to 1000 new attempts to acquire a channel are delayed (pending) until a channel is returned to the pool again, and further attempts are declined with an error. What you are seeing is that you are actively using all 500 of the connections in the connection pool and you have filled up the "pending" queue with 1000 pending requests. You have 2 options to solve this Scale vertically Increase the connection pool size and or the acquire queue length ConnectionProvider connectionProvider = ConnectionProvider.builder("myConnectionPool") .maxConnections(<your_desired_max_connections>) .pendingAcquireMaxCount(<your_desired_pending_queue_size>) .build(); ReactorClientHttpConnector clientHttpConnector = new ReactorClientHttpConnector(HttpClient.create(connectionProvider)); WebClient.builder() .clientConnector(clientHttpConnector) .build(); Scale horizontally Create additional instances of your app and load balance the api calls between your instances. Spring reference docs Additional note: It's worth considering the latency of your downstream api call when calculating the size of your connection pool. A good place to start is connection_pool_size = tps * downstream_api_latency tps (transactions per second)
Stripped pool starvation in WAL writing causes node cluster node failure
Moderate workload on 3 node ignite cluster causes one node to fail with stripped pool startvation while archiving WAL. This happens one or two times in week. I already checked all IO problems which could hang WAL rollover. But this issue still persist I am using latest ignite 2.7 as a library inside spring boot application : >>> Possible starvation in striped pool. Deadlock: false Completed: 1397 Thread [name="sys-stripe-7-#8%server.node%", id=22, state=WAITING, blockCnt=3, waitCnt=757] Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.awaitNext(FileWriteAheadLogManager.java:2871) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$2300(FileWriteAheadLogManager.java:2451) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1205) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:836) at o.a.i.i.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4267) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6333) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6082) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5782) at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3719) at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5900(BPlusTree.java:3613) at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1895) at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1779) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1638) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621) at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428) at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2295) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3242) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:135) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:309) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:304) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197) at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127) at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093) at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:505) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) ERROR --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-1, blockedFor=10s] WARN --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Thread [name="sys-stripe-1-#2%server.node%", id=16, state=WAITING, blockCnt=0, waitCnt=754] Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
Failure Detection feature is not very well configured in Apache Ignite 2.7 by default. You can turn it off (by setting to NoOp) or set a large failureDetectionTimeout to avoid such messages (and shutdown of nodes).
Using RabbitMQ as Flink DataStream Source without create RabbitMQ queue automatically
When I use RabbitMQ as Flink DataStream Source,just as the Flink Documentation said. final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // checkpointing is required for exactly-once or at-least-once guarantees env.enableCheckpointing(...); final RMQConnectionConfig connectionConfig = new RMQConnectionConfig.Builder() .setHost("localhost") .setPort(5000) ... .build(); final DataStream<String> stream = env .addSource(new RMQSource<String>( connectionConfig, // config for the RabbitMQ connection "queueName", // name of the RabbitMQ queue to consume true, // use correlation ids; can be false if only at-least-once is required new SimpleStringSchema())) // deserialization schema to turn messages into Java objects .setParallelism(1); // non-parallel source is only required for exactly-once This code will connect to RabbitMQ and auto create Queue "queueName".So I have got a problem. The RabbitMQ Queue already exist,I created it before. I don't want Flink try to create again. And the problem is Flink create the Queue without some paramters, that is conflict with the Queue I created before. Here is the Exception: Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'x-message-ttl' for queue 'queueName' in vhost '/': received none but current is the value '604800000' of type 'long', class-id=50, method-id=10) at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:66) at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36) at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:443) at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:263) at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:136) ... 10 more How to make Flink just subscribe a RabbitMQ queue without try to create a new one? Thank you all.
You can write your own class extending RMQSource and override setupQueue method in order to not create queue
Apache Ignite - Distibuted Queue and Executors
I am planning to use Apache Ignite Distributed Queue. I am using Ignite with a spring boot application. So, on bootup, I will be adding 20 names in a queue. But, since there are 3 servers in a cluster, the same 20 names gets added 3 times. But, i want to add them only once in the queue. Ignite ignite = Ignition.ignite(); IgniteQueue<String> queue = ignite.queue( "queueName", // Queue name. 0, // Queue capacity. 0 for unbounded queue. null // Collection configuration. ); Distributed executors, will be able to poll from the queue and run the task. Here, the executor is expected to poll, run the task and then add the same name to the queue. Trying to achieve round robin here. Only one executor should be running the same task at any point of time, though there are multiple servers in a cluster. Any suggestion for this.
You can launch ignite cluster singleton service https://apacheignite.readme.io/docs/cluster-singletons which will fill data to queue. Also you can adding data from coordinator node (oldest node in cluster) ignite.cluster().forOldest().node().isLocal()
I fixed bootup time duplicate cache loading issue this way: final IgniteAtomicLong cacheLoadCnt = ignite.atomicLong(cacheName + "Cnt", 0, true); if (cacheLoadCnt.get() == 0) { loadCache(); cacheLoadCnt.addAndGet(1); }