ElastiCache - Redis Cluster mode enabled fails to write data in-between - redis

I am using a redis cluster of node type m5.4x large with 1 node , in-order to cache some results. The writes to this redis node is very frequent. And I could see that intermittently the writes to cluster fails.
Below is the stack trace we see in logs for the failure.
org.redisson.client.WriteRedisConnectionException: Unable to send
command! Node source:
NodeSource[slot=null,addr=null,redisClient=null,redirect=null,entry=org.redisson.connection.MasterSlaveEntry#6608962a],
connection: [id: 0xbad70cba, L:0.0.0.0/0.0.0.0:47904], command:
(EVAL),params: [local insertable = false; local value =
redis.call('hget',KEYS[1], ARGV[5]); local t, val;if value ..., 8,
SEARCH_CACHE, redisson__timeout__set:{SEARCH_CACHE},
redisson__idle__set:{SEARCH_CACHE},
redisson_map_cache_created:{SEARCH_CACHE},
redisson_map_cache_updated:{SEARCH_CACHE},
redisson__map_cache__last_access__set:{SEARCH_CACHE},
redisson_map_cache_removed:{SEARCH_CACHE},
{SEARCH_CACHE}:redisson_options, ...] at
org.redisson.command.CommandAsyncService.checkWriteFuture(CommandAsyncService.java:675)
at
org.redisson.command.CommandAsyncService.access$100(CommandAsyncService.java:84)
at
org.redisson.command.CommandAsyncService$9$1.operationComplete(CommandAsyncService.java:638)
at
org.redisson.command.CommandAsyncService$9$1.operationComplete(CommandAsyncService.java:635)
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:485)
at
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1371)
at
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.nio.channels.ClosedChannelException at
io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
Source)
I am using redisson client version 3.6.5.
Can someone please help me to identify what is the issue?
Below is the configuration I have setup for redis cluster connection
idleConnectionTimeout: 1000
pingTimeout: 10000
connectTimeout: 10000
timeout: 30000
retryAttempts: 3
retryInterval: 1500
reconnectionTimeout: 30000
failedAttempts: 3
subscriptionsPerConnection: 5
slaveSubscriptionConnectionMinimumIdleSize: 1
slaveSubscriptionConnectionPoolSize: 50
slaveConnectionPoolSize: 250
masterConnectionMinimumIdleSize: 5
masterConnectionPoolSize: 250

Related

aerospike connect timeout works incorrectly?

I'm using aerospike java client v 6.0.1 and use the following configs from client read policy:
clientPolicy.readPolicyDefault.connectTimeout = 1000;
clientPolicy.readPolicyDefault.socketTimeout = 30;
clientPolicy.readPolicyDefault.totalTimeout = 110;
clientPolicy.readPolicyDefault.maxRetries = 2;
clientPolicy.readPolicyDefault.sleepBetweenRetries = 0;
but I'm getting the following errors from time to time, which say that not all retries were used and timeout occurred:
org.springframework.dao.QueryTimeoutException: Client timeout: iteration=0 connect=1000 socket=30 total=110 maxRetries=2 node=null inDoubt=false; nested exception is com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=0 connect=1000 socket=30 total=110 maxRetries=2 node=null inDoubt=false
org.springframework.dao.QueryTimeoutException: Client timeout: iteration=1 connect=1000 socket=30 total=110 maxRetries=2 node=A2 node_ip 3000 inDoubt=false; nested exception is com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=1 connect=1000 socket=30 total=110 maxRetries=2 node=A2 node_ip 3000 inDoubt=false
Does it mean that total operation timeout also involves connect to Aerospike node? Aerospike docs state that total timeout starts after connect timeout finishes:
If connectTimeout is greater than zero, it will be applied to creating a connection plus optional user authentication and TLS handshake. When the connect completes, socketTimeout/totalTimeout is then applied. In this case, totalTimeout starts after the connection completes. see https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852
99% of all my requests to aerospike take less than 20 ms and it doesn't make sense for me to increate total timeout.
Originally I had 200-300 ms connect timeout and I increased it to 1000 ms, but it didn't help much
Transactions can sometimes timeout before the transaction has started. For example, async transactions can be throttled and can exist in the delay queue for longer than totalTimeout. If this occurs, a timeout exception is generated with iteration=0.
Anytime totalTimeout is reached, the transaction is cancelled regardless of the number of retries.
If connectTimeout is used and a new connection is required (no available connections in the pool) for the transaction, the connectTimeout is applied to connection creation and the totalTimeout stopwatch does not start until the new connection is created.
If connectTimeout is used and an existing connection is available from the pool, the connectTimeout is not applicable and the totalTimeout stopwatch starts from the beginning of the transaction.
Since most transactions are able to obtain connections from the pool, it's not surprising that increasing connectTimeout has little effect.

Aerospike read times out before max retries is reached

I have the following config for aerospike read policy:
clientPolicy.timeout = 200; // timeout for refreshing cluster status, shouldn't affect reads
clientPolicy.readPolicyDefault.socketTimeout = 30;
clientPolicy.readPolicyDefault.totalTimeout = 110;
clientPolicy.readPolicyDefault.maxRetries = 2;
clientPolicy.readPolicyDefault.sleepBetweenRetries = 0;
According to what I found in Aerospike docs this should result in 3 read attempts 30ms max for each (1 initial + 2 retries), which in total is 90 ms and it is less than total timeout of 110 ms.
But in application logs I see timeout exceptions after 1 retry:
org.springframework.dao.QueryTimeoutException: Client timeout: iteration=1 connect=0 socket=30 total=110 maxRetries=2 node= inDoubt=false;
nested exception is com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=1 connect=0 socket=30 total=110 maxRetries=2 node= inDoubt=false
...
Caused by: com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=1 connect=0 socket=30 total=110 maxRetries=2 node= inDoubt=false
Is there anything I'm missing? Maybe there are more actions that occur and are included in this total timeout?
Seems like to me you don't have a connection to a node. Try with clientPolicy.timeout = 1000 (default). You may have timed out in trying to establish an initial connection to a node.

Cache partition not replicated

I have 2 nodes with the persistence enabled. I create a cache like so
// all the queues across the frontier instances
CacheConfiguration cacheCfg2 = new CacheConfiguration("queues");
cacheCfg2.setBackups(backups);
cacheCfg2.setCacheMode(CacheMode.PARTITIONED);
globalQueueCache = ignite.getOrCreateCache(cacheCfg2);
where backups is a value > 1
When one of the nodes dies, I get
Exception in thread "Thread-2" javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=queues, part=2]
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryAdapter.executeScanQuery(GridCacheQueryAdapter.java:597)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$1.applyx(IgniteCacheProxyImpl.java:519)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$1.applyx(IgniteCacheProxyImpl.java:517)
at org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:3482)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:516)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:843)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.query(GatewayProtectedCacheProxy.java:418)
at crawlercommons.urlfrontier.service.ignite.IgniteService$QueueCheck.run(IgniteService.java:270)
Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=queues, part=2]
... 9 more
I expected the content to have been replicated onto the other node. Why isn't that the case?
Most likely there is a misconfiguration somewhere. Check the following:
you are not working with an existing cache (replace getOrCreateCache to createCache)
you are not having more server nodes than the backup factor is
inspect the logs for "Detected lost partitions" message and what happened prior

Facing issue "WebClientRequestException: Pending acquire queue has reached its maximum size of 1000" with spring reactive webClient

I am running load of a microservice API, which involves calling other microservice API using Spring Reactive Webclient. I am using Postman runner tab to test this.
Firstly, i run the load with 1500 iteration, second microservice is getting called for each request and everything is working fine as expected.
But when i run the load with 5000 iteration, second microservice is getting called for for 3500 times and 1500 calls are failing due to issue
WebClientRequestException: Pending acquire queue has reached its maximum size of 1000
Using org.springframework.web.reactive.function.client.WebClient with default configuration, below is the code snippet.
private WebClient webClient;
#PostConstruct
public void init() {
this.webClient = WebClient.builder().defaultHeader(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
.build();
}
what can be done to avoid this?
I am using latest spring-boot-starter-parent dependency (with version 2.5.3) with spring-webflux-5.3.9.jar jar.
the logs:
reactor.core.Exceptions$ErrorCallbackNotImplemented: reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3
Caused by: reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3
at reactor.core.Exceptions.retryExhausted(Exceptions.java:290)
at reactor.util.retry.RetryBackoffSpec.lambda$static$0(RetryBackoffSpec.java:67)
at reactor.util.retry.RetryBackoffSpec.lambda$generateCompanion$4(RetryBackoffSpec.java:557)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:375)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerComplete(FluxConcatMap.java:296)
at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onComplete(FluxConcatMap.java:885)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1817)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:232)
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
**Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: Pending acquire queue has reached its maximum size of 1000; nested exception is reactor.netty.internal.shaded.reactor.pool.PoolAcquirePendingLimitException: Pending acquire queue has reached its maximum size of 1000**
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
|_ checkpoint ⇢ Request to POST http://172.20.0.2:3130/v1/login/mobile [DefaultWebClient]
Stack trace:
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)
at reactor.core.publisher.Mono.subscribe(Mono.java:4338)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)
at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)
at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)
at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:414)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.onNext(FluxConcatMap.java:251)
at reactor.core.publisher.EmitterProcessor.drain(EmitterProcessor.java:491)
at reactor.core.publisher.EmitterProcessor.tryEmitNext(EmitterProcessor.java:299)
at reactor.core.publisher.SinkManySerialized.tryEmitNext(SinkManySerialized.java:100)
at reactor.core.publisher.InternalManySink.emitNext(InternalManySink.java:27)
at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onError(FluxRetryWhen.java:190)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onError(HttpClientConnect.java:304)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189)
at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onError(DefaultPooledConnectionProvider.java:172)
at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.fail(AbstractPool.java:444)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.pendingOffer(SimpleDequePool.java:543)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:266)
at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:399)
at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onSubscribe(DefaultPooledConnectionProvider.java:212)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:674)
at reactor.netty.resources.PooledConnectionProvider.lambda$acquire$1(PooledConnectionProvider.java:137)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.lambda$subscribe$0(HttpClientConnect.java:268)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.core.publisher.FluxRetryWhen.subscribe(FluxRetryWhen.java:77)
at reactor.core.publisher.MonoRetryWhen.subscribeOrReturn(MonoRetryWhen.java:46)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:57)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:271)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.resubscribe(FluxRetryWhen.java:216)
at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onNext(FluxRetryWhen.java:269)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerNext(FluxConcatMap.java:282)
at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onNext(FluxConcatMap.java:861)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:232)
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
**Caused by: reactor.netty.internal.shaded.reactor.pool.PoolAcquirePendingLimitException: Pending acquire queue has reached its maximum size of 1000
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.pendingOffer(SimpleDequePool.java:543)**
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:266)
at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:399)
at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onSubscribe(DefaultPooledConnectionProvider.java:212)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:674)
at reactor.netty.resources.PooledConnectionProvider.lambda$acquire$1(PooledConnectionProvider.java:137)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.lambda$subscribe$0(HttpClientConnect.java:268)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.core.publisher.FluxRetryWhen.subscribe(FluxRetryWhen.java:77)
at reactor.core.publisher.MonoRetryWhen.subscribeOrReturn(MonoRetryWhen.java:46)
WebClient needs an HTTP client library to perform requests with and by default it uses Reactor Netty.
Quote from the Reactor-netty reference docs
By default, Reactor Netty client uses a “fixed” connection pool with
500 as the maximum number of active channels and 1000 as the maximum
number of further channel acquisition attempts allowed to be kept in a
pending state (for the rest of the configurations check the system
properties or the builder configurations below). This means that the
implementation creates a new channel if someone tries to acquire a
channel as long as less than 500 have been created and are managed by
the pool. When the maximum number of channels in the pool is reached,
up to 1000 new attempts to acquire a channel are delayed (pending)
until a channel is returned to the pool again, and further attempts
are declined with an error.
What you are seeing is that you are actively using all 500 of the connections in the connection pool and you have filled up the "pending" queue with 1000 pending requests.
You have 2 options to solve this
Scale vertically
Increase the connection pool size and or the acquire queue length
ConnectionProvider connectionProvider = ConnectionProvider.builder("myConnectionPool")
.maxConnections(<your_desired_max_connections>)
.pendingAcquireMaxCount(<your_desired_pending_queue_size>)
.build();
ReactorClientHttpConnector clientHttpConnector = new ReactorClientHttpConnector(HttpClient.create(connectionProvider));
WebClient.builder()
.clientConnector(clientHttpConnector)
.build();
Scale horizontally
Create additional instances of your app and load balance the api calls between your instances.
Spring reference docs
Additional note:
It's worth considering the latency of your downstream api call when calculating the size of your connection pool. A good place to start is
connection_pool_size = tps * downstream_api_latency
tps (transactions per second)

YARN RM not releasing the resources

I'm running spark with yarn as Resource Manager(RM). I'm submitting the application with max attempts 2 i.e. spark.yarn.maxAppAttempts=2.One of the application is processing around 3 TB of data, because of memory issues, attempt1 failed(after processing 5 tables out of 10 tables) and attempt 2 started. Even though attempt1 failed YARN is not releasing the resources(executors) to use the same for attempt 2. Not understanding why YARN not releasing the resources. Below is the conf.
spark.executor.memory=30G
spark.executor.cores=5
spark.executor.instances=95
spark.yarn.executor.memoryOverhead=8G
Total number executors available are 100 out of which I'm trying to use 95, attempt 1 trying to use all the 95 executors. After attempt 1 failed, attempt 2 started with 5 executors. As per my understanding attempt 2 should start with 95 executors like attempt 1 as attempt 1 failed and all the resources should be available for attempt 2.