aerospike connect timeout works incorrectly? - aerospike

I'm using aerospike java client v 6.0.1 and use the following configs from client read policy:
clientPolicy.readPolicyDefault.connectTimeout = 1000;
clientPolicy.readPolicyDefault.socketTimeout = 30;
clientPolicy.readPolicyDefault.totalTimeout = 110;
clientPolicy.readPolicyDefault.maxRetries = 2;
clientPolicy.readPolicyDefault.sleepBetweenRetries = 0;
but I'm getting the following errors from time to time, which say that not all retries were used and timeout occurred:
org.springframework.dao.QueryTimeoutException: Client timeout: iteration=0 connect=1000 socket=30 total=110 maxRetries=2 node=null inDoubt=false; nested exception is com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=0 connect=1000 socket=30 total=110 maxRetries=2 node=null inDoubt=false
org.springframework.dao.QueryTimeoutException: Client timeout: iteration=1 connect=1000 socket=30 total=110 maxRetries=2 node=A2 node_ip 3000 inDoubt=false; nested exception is com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=1 connect=1000 socket=30 total=110 maxRetries=2 node=A2 node_ip 3000 inDoubt=false
Does it mean that total operation timeout also involves connect to Aerospike node? Aerospike docs state that total timeout starts after connect timeout finishes:
If connectTimeout is greater than zero, it will be applied to creating a connection plus optional user authentication and TLS handshake. When the connect completes, socketTimeout/totalTimeout is then applied. In this case, totalTimeout starts after the connection completes. see https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852
99% of all my requests to aerospike take less than 20 ms and it doesn't make sense for me to increate total timeout.
Originally I had 200-300 ms connect timeout and I increased it to 1000 ms, but it didn't help much

Transactions can sometimes timeout before the transaction has started. For example, async transactions can be throttled and can exist in the delay queue for longer than totalTimeout. If this occurs, a timeout exception is generated with iteration=0.
Anytime totalTimeout is reached, the transaction is cancelled regardless of the number of retries.
If connectTimeout is used and a new connection is required (no available connections in the pool) for the transaction, the connectTimeout is applied to connection creation and the totalTimeout stopwatch does not start until the new connection is created.
If connectTimeout is used and an existing connection is available from the pool, the connectTimeout is not applicable and the totalTimeout stopwatch starts from the beginning of the transaction.
Since most transactions are able to obtain connections from the pool, it's not surprising that increasing connectTimeout has little effect.

Related

Aerospike read times out before max retries is reached

I have the following config for aerospike read policy:
clientPolicy.timeout = 200; // timeout for refreshing cluster status, shouldn't affect reads
clientPolicy.readPolicyDefault.socketTimeout = 30;
clientPolicy.readPolicyDefault.totalTimeout = 110;
clientPolicy.readPolicyDefault.maxRetries = 2;
clientPolicy.readPolicyDefault.sleepBetweenRetries = 0;
According to what I found in Aerospike docs this should result in 3 read attempts 30ms max for each (1 initial + 2 retries), which in total is 90 ms and it is less than total timeout of 110 ms.
But in application logs I see timeout exceptions after 1 retry:
org.springframework.dao.QueryTimeoutException: Client timeout: iteration=1 connect=0 socket=30 total=110 maxRetries=2 node= inDoubt=false;
nested exception is com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=1 connect=0 socket=30 total=110 maxRetries=2 node= inDoubt=false
...
Caused by: com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=1 connect=0 socket=30 total=110 maxRetries=2 node= inDoubt=false
Is there anything I'm missing? Maybe there are more actions that occur and are included in this total timeout?
Seems like to me you don't have a connection to a node. Try with clientPolicy.timeout = 1000 (default). You may have timed out in trying to establish an initial connection to a node.

Facing issue "WebClientRequestException: Pending acquire queue has reached its maximum size of 1000" with spring reactive webClient

I am running load of a microservice API, which involves calling other microservice API using Spring Reactive Webclient. I am using Postman runner tab to test this.
Firstly, i run the load with 1500 iteration, second microservice is getting called for each request and everything is working fine as expected.
But when i run the load with 5000 iteration, second microservice is getting called for for 3500 times and 1500 calls are failing due to issue
WebClientRequestException: Pending acquire queue has reached its maximum size of 1000
Using org.springframework.web.reactive.function.client.WebClient with default configuration, below is the code snippet.
private WebClient webClient;
#PostConstruct
public void init() {
this.webClient = WebClient.builder().defaultHeader(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
.build();
}
what can be done to avoid this?
I am using latest spring-boot-starter-parent dependency (with version 2.5.3) with spring-webflux-5.3.9.jar jar.
the logs:
reactor.core.Exceptions$ErrorCallbackNotImplemented: reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3
Caused by: reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3
at reactor.core.Exceptions.retryExhausted(Exceptions.java:290)
at reactor.util.retry.RetryBackoffSpec.lambda$static$0(RetryBackoffSpec.java:67)
at reactor.util.retry.RetryBackoffSpec.lambda$generateCompanion$4(RetryBackoffSpec.java:557)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:375)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerComplete(FluxConcatMap.java:296)
at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onComplete(FluxConcatMap.java:885)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1817)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:232)
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
**Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: Pending acquire queue has reached its maximum size of 1000; nested exception is reactor.netty.internal.shaded.reactor.pool.PoolAcquirePendingLimitException: Pending acquire queue has reached its maximum size of 1000**
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
|_ checkpoint ⇢ Request to POST http://172.20.0.2:3130/v1/login/mobile [DefaultWebClient]
Stack trace:
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)
at reactor.core.publisher.Mono.subscribe(Mono.java:4338)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)
at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)
at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)
at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:414)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.onNext(FluxConcatMap.java:251)
at reactor.core.publisher.EmitterProcessor.drain(EmitterProcessor.java:491)
at reactor.core.publisher.EmitterProcessor.tryEmitNext(EmitterProcessor.java:299)
at reactor.core.publisher.SinkManySerialized.tryEmitNext(SinkManySerialized.java:100)
at reactor.core.publisher.InternalManySink.emitNext(InternalManySink.java:27)
at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onError(FluxRetryWhen.java:190)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onError(HttpClientConnect.java:304)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189)
at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onError(DefaultPooledConnectionProvider.java:172)
at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.fail(AbstractPool.java:444)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.pendingOffer(SimpleDequePool.java:543)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:266)
at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:399)
at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onSubscribe(DefaultPooledConnectionProvider.java:212)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:674)
at reactor.netty.resources.PooledConnectionProvider.lambda$acquire$1(PooledConnectionProvider.java:137)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.lambda$subscribe$0(HttpClientConnect.java:268)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.core.publisher.FluxRetryWhen.subscribe(FluxRetryWhen.java:77)
at reactor.core.publisher.MonoRetryWhen.subscribeOrReturn(MonoRetryWhen.java:46)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:57)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:271)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.resubscribe(FluxRetryWhen.java:216)
at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onNext(FluxRetryWhen.java:269)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerNext(FluxConcatMap.java:282)
at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onNext(FluxConcatMap.java:861)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:232)
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
**Caused by: reactor.netty.internal.shaded.reactor.pool.PoolAcquirePendingLimitException: Pending acquire queue has reached its maximum size of 1000
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.pendingOffer(SimpleDequePool.java:543)**
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:266)
at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:399)
at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onSubscribe(DefaultPooledConnectionProvider.java:212)
at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:674)
at reactor.netty.resources.PooledConnectionProvider.lambda$acquire$1(PooledConnectionProvider.java:137)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.lambda$subscribe$0(HttpClientConnect.java:268)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:57)
at reactor.core.publisher.FluxRetryWhen.subscribe(FluxRetryWhen.java:77)
at reactor.core.publisher.MonoRetryWhen.subscribeOrReturn(MonoRetryWhen.java:46)
WebClient needs an HTTP client library to perform requests with and by default it uses Reactor Netty.
Quote from the Reactor-netty reference docs
By default, Reactor Netty client uses a “fixed” connection pool with
500 as the maximum number of active channels and 1000 as the maximum
number of further channel acquisition attempts allowed to be kept in a
pending state (for the rest of the configurations check the system
properties or the builder configurations below). This means that the
implementation creates a new channel if someone tries to acquire a
channel as long as less than 500 have been created and are managed by
the pool. When the maximum number of channels in the pool is reached,
up to 1000 new attempts to acquire a channel are delayed (pending)
until a channel is returned to the pool again, and further attempts
are declined with an error.
What you are seeing is that you are actively using all 500 of the connections in the connection pool and you have filled up the "pending" queue with 1000 pending requests.
You have 2 options to solve this
Scale vertically
Increase the connection pool size and or the acquire queue length
ConnectionProvider connectionProvider = ConnectionProvider.builder("myConnectionPool")
.maxConnections(<your_desired_max_connections>)
.pendingAcquireMaxCount(<your_desired_pending_queue_size>)
.build();
ReactorClientHttpConnector clientHttpConnector = new ReactorClientHttpConnector(HttpClient.create(connectionProvider));
WebClient.builder()
.clientConnector(clientHttpConnector)
.build();
Scale horizontally
Create additional instances of your app and load balance the api calls between your instances.
Spring reference docs
Additional note:
It's worth considering the latency of your downstream api call when calculating the size of your connection pool. A good place to start is
connection_pool_size = tps * downstream_api_latency
tps (transactions per second)

TCP Connection issue Connection refused

even if I am having a huge value of ListenBackLog=10000 and MaxConnection=10000 then also I am getting "tcp error code 10061 target machine actively refused". Its not every time but when load increases then error start appearing and one service is not able to communicate with other.
Following are the other values,
ListenBacklog = 10000;
MaxBufferPoolSize = 500000;
MaxBufferSize = 2000000000;
MaxConnections = 10000;
MaxReceivedMessageSize = 2000000000;
We already have MaxConcurrentCalls,MaxConcurrentSessions and MaxConcurrentInstances set as 10000.
In normal condition its working fine but when load increases services are not able to communicate with each other.
If we observer the performance counter then Calls Outstanding is lies between 150 to 250. Services are hosted as windows service.
any suggestion or thoughts?

create a mass of connections on client side by twisted

I use twisted to do a test job for a server. I need create a lot of connections connect to the server. This is my code:
class Account(Protocol):
def connectionMade(self):
print "connection made"
def connectionLost(self, reason):
print "connection Lost. reason: ", reason
def createAccount(self, name):
self.transport.write(...)
print "create account: ", name
class AccountFactory(Factory):
def buildProtocol(self, addr):
return Account()
def accountCreate(p, i):
print "begin create"
p.createAccount(NAME_PREFIX+str(i))
def onError(err):
return 'error: ', err
c = 0
while c < 100:
accountPoint = TCP4ClientEndpoint(reactor, server_ip, port)
accountConn = accountPoint.connect(AccountFactory())
accountConn.addCallback(accountCreate, c)
accountConn.addErrback(onError)
c += 1
reactor.run()
If server and client located in same LAN, there is no problem, all of 100 "create account: xxx" will printed. But when I put server on a remote address(internet), the client only prints near 50% number of "create account: xxx". onError doesn't fire.
The log is:
2014-07-29 15:57:06+0800 [Uninitialized] connection made
2014-07-29 15:57:06+0800 [Uninitialized] begin create
2014-07-29 15:57:06+0800 [Uninitialized] create account: xxx
repeat 60 times
2014-07-29 15:57:17+0800 [Uninitialized] Stopping factory <__main__.AccountFactory instance at xxx>
repeat 40 times
Some callback failed to be calling, even the connection haven't be made. The only different is the latency between server and client.
The most interested thing is the duration between first success log and first "Stopping factory" log is exactly 20 seconds(I try this many times). But I am sure this is not caused by timeout because TCP4ClientEndpoint default timeout is 30 seconds.
And the log time stamp is also abnormal, the log time stamp is in bundle, for example: 10 logs are 2014-07-29 17:25:09, 20 logs are 2014-07-29 17:25:15. If the connection made in async manner, the time stamp should be random enough. It should not gather together: made 10 connections at time point a, made another 20 at time point a+15sec. Or log utility problem?
Revised:
I think this is bug of twisted. The reason of "Stopping" is timeout. When I run this in linux, the time duration between first log and first stopping is timeout seconds I passed into TCP4ClientEndpoint, but under windows whatever I set the timeout seconds, the duration always 21 seconds. I use socket(blocking) to do same thing instead, all is pretty good. So this should be a bug in twisted which involve timeout when making a lot of connections.
You haven't added any error handlers to your code, nor have you enabled logging so that unhandled errors will be reported anywhere.
Enable logging, either by calling twisted.python.log.startLogging or by writing your code as an ISeviceMaker plugin and running it with twistd.
And add errbacks to each Deferred in your application so you can handle failures from their associate operations.

WCF ReliableSession and Timeouts

I have a WCF service used mainly for managing documents in a repository.
I used the chunking channel sample from MS so that I could upload/download huge files.
Now I implemented reliable session with the service and I am seeing some strange behaviors.
Here are the timeout values I am using.
this.SendTimeout = new TimeSpan(0,10,0);
this.OpenTimeout = new TimeSpan(0, 1, 0);
this.CloseTimeout = new TimeSpan(0, 1, 0);
this.ReceiveTimeout = new TimeSpan(0,10, 0);
reliableBe.InactivityTimeout = new TimeSpan(0,2,0);
I have the following issues:
1. If the Service is not up & running, the clients are not get disconnected after OpenTimeout.
I tried it with my test client.
Scenario 1: Without Reliable Session:
I get the following exception:
Could not connect to net.tcp://localhost:8788/MediaManagementService/ep1. The connection attempt lasted for a time span of 00:00:00.9848790. TCP error code 10061: No connection could be made because the target machine actively refused it 127.0.0.1:8788
This is the correct behavior as I have given the OpenTimeout as 1 sec.
Scenario 2: With ReliableSession:
I get the same exception:
Could not connect to net.tcp://localhost:8788/MediaManagementService/ep1. The connection attempt lasted for a time span of 00:00:00.9692460. TCP error code 10061: No connection could be made because the target machine actively refused it 127.0.0.1:8788.
But this message comes after around 10 mintes . (I believe after SendTimeout)
So here I just have enabled the reliable session and now it looks like the OpenTimeout = SendTimeout for the client.
Is this desired behavior?
2: Issue while uploading huge files with ReliableSession:
The general rule is that you have to set a huge value for the maxReceivedMessageSize, SendTimeout and ReceiveTimeout.
But in the case of Chunking channel, the max received message size doesn't matter as the data is sent in chunks.
So I set a huge value for Send and ReceiveTimeout : say 10 hours.
Now the upload is going fine, but it has a side effect that, even if the Service is not up, it takes 10 hours to timeout the client connection due to the behavior mentioned in (1).
Please let me know your thoughts on this behavior.