Aerospike Error while receiving on FD: 11 (Resource temporarily unavailable) - aerospike

We are using Aerospike in cluster mode and when we enable the log severity to Debug, we are seeing error logged under debug level Error while receiving on FD 43: 11 (Resource temporarily unavailable). This is not logged as a Error severity in Aerospike though. so i am wondering is this an error we really should worry about or is this just some error Aerospike is encountering and it is handled automatically (that we can ignore in this case). we are continuously seeing this in the logs and worrying some lagging is happening.
thanks in advance.

I wouldn't necessarily worry if this is only logged at Debug level and there is no noticeable impact, but I am not certain... could be simply how epol works under aggressive workloads.
Cross posted on https://discuss.aerospike.com/t/getting-fd-resource-temporaryly-unavailable-in-cluster-mode/7030

This is an "EAGAIN" error (see http://man7.org/linux/man-pages/man3/errno.3.html):
EAGAIN Resource temporarily unavailable (may be the same
value as EWOULDBLOCK) (POSIX.1-2001).
Means the Aerospike attempted a non-blocking operation and the socket couldn't take at this time. The server will attempt the operation again later.
This isn't anything to be concerned of, EAGAIN and EWOULDBLOCK errors are expected.

Related

SO_KEEPALIVE issue in Mulesoft

we had a Mulesoft app that basically picks message from queue (ActiveMQ), then posts to target app via HTTP request to target's API.
Runtime: 4.3.0
HTTP Connector version: v1.3.2
Server: Windows, On-premise standalone
However, sometimes the message doesn't get sent successfully after picking from queue , and below message can be found in the log -
WARN 2021-07-10 01:24:46,080 [[masked-app].http.requester.requestConfig.02 SelectorRunner] [event: ] org.glassfish.grizzly.nio.transport.TCPNIOTransport: GRIZZLY0005: Can not set SO_KEEPALIVE to false
java.net.SocketException: Invalid argument: no further information
at sun.nio.ch.Net.setIntOption0(Native Method) ~[?:1.8.0_281]
The flow completed silently without any error after above message, hence no error handling happens.
I found this mentioning it is a known bug on Windows server and won’t affect the well behavior of the application, but the document is failing to set SO_KEEPALIVE to true rather than false.
Looks the message didn't get posted successfully as the target system team can't find corresponding incoming request in their log.
It is not acceptable as the message is critical and no one knows unless the target system realizes something is wrong... Not sure if the SO_KEEPALIVE is failing to be set to false is the root cause, could you please share some thoughts? Thanks a lot in advance.
The is probably unrelated to the warning you mentioned but there doesn't seem to be enough information to identify the actual root cause.
Having said that the version of the HTTP connector is old and it's missing almost 3 years of fixes. Updating the version to the last one should improve the reliability of the application.

JVM Appears to be hung with Outofheapspace error while having response payload size more than 3 mb in Mule 4

I am using mule 4 to retrieve records from database and show it in the response . Somehow I see all the components are getting passed successfully but while streaming the response its failing . I am trying to call from postman and I see error:
<h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
In the studio , I get logs like :
Pinging the JVM took 9 seconds to respond.
JVM appears hung: Timed out waiting for signal from JVM. Requesting thread dump.
Dumping JVM state.
JVM appears hung: Timed out waiting for signal from JVM. Restarting JVM.
JVM exited after being requested to terminate.
JVM Restarts disabled. Shutting down.
<-- Wrapper Stopped
Could anyone help me on this .
Thanks
Sanjukta
Some information is not being streamed. You didn't provide any details of the implementation but clearly something is consuming a lot of heap memory. It may not be the database, but some other component. Check the streaming configuration for your components.
To identify the cause locally you can capture a heap dump and analyze it while the runtime in studio is timing out for the ping before it crashes. That is probably because of high garbage collection activity.
This is a symptom that your JVM heap memory is full, check your settings in Anypoint Studio and see how much is allocated
Check this article
https://help.mulesoft.com/s/article/Out-Of-Memory-in-Studio-Application-How-to-increase-the-maximum-heap-size?r=6&ui-force-components-controllers-recordGlobalValueProvider.RecordGvp.getRecord=1

Apache Flink error checkpointing to S3

We have Apache Flink (1.4.2) running on an EMR cluster. We are checkpointing to an S3 bucket, and are pushing about 5,000 records per second through the flows. We recently saw the following error in our logs:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
Immediately after this we got the following in our logs:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 # 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
At this point, the flow crashed and was not able to automatically recover, however we were able to restart the flow manually, without needing to change the location of the s3 bucket. The fact that the crash occurred while pushing to S3, makes me think that is the crux of the problem.
Any ideas?
FYI, this was caused by too much cross-talk between nodes flooding the NICs on each server. The solution was more intelligent partitioning.

What does this error error:140000DB:SSL routines:SSL routines:short read means

In our software, we are keep getting this warning/error message intermittently. Not sure how/why this message appears.
HTTP asio handshake failed: error:140000DB:SSL routines:SSL
routines:short read
I searched in the Internet, but the mostly the result pointing me to a VMware problem. Which not the case for me.
Until I found out that actually this error is thrown by OpenSSL that is used by Boost-Asio. I have downloaded the source code of OpenSSL/Asio/Boost but couldn't find this error code in the source. My question, Is there anyone knows what this error means? What could be the trigger of this error message? I just want to understand a bit to find out the reproduction. So we can fix our software if there is any hole in the software.
Many Thanks in advance!
Reference:
http://ib-krajewski.blogspot.my/2016/03/https-support-for-casablanca-client.html
how to clean boost::asio::ssl::stream after closed by server
A commit in OpenSSL removed the error SSL_R_SHORT_READ.
The commit before before OpenSSL removed the error SSL_R_SHORT_READ still has it defined as 219 == 0xDB. This error of 0xDB is what comes out of OpenSSL as 0x140000DB.
In general a short read happens on TCP when the connection ended before the other side could send enough data to decode the current message. This may happen for example because the other side crashed or a network problem.
Found the root cause for my problem. There is mismatch of cipher the host and the client that trying to connect to. Then this error is thrown from the client.

NServiceBus exceptions logged as INFO messages

I'm running an NServiceBus endpoint on an Azure Workerrole. I send all diagnostics to table storage at the moment. I was getting messages in my DLQ, and I couldn't figure out why I wasn't getting any exceptions logged in my table storage.
It turns out that NSB logs the exceptions as INFO, which is why I couldn't easily spot them in between all the actual verbose logging.
In my case, a command handler's dependencies couldn't be resolved so Autofac throws an exception. I totally get why the exception is thrown, I just don't understand why they're logged as INFO. The message ends up in my DLQ, and I only have a INFO-trace to understand why.
Is there a reason why exceptions are handled this way in NSB?
NServiceBus is not logging container issue as an error because it's happening during an attempt to process a message. First Level Retry and Second Level Retry will be attempted. When SLR is executed, it will log a WARN about the retry. Ultimately, a message will fail processing and an ERROR message will be logged. NSB and Autofac sample can be used to reproduce this.
When endpoint is running with a scaled out role and MadDeliveryCount is not big enough to accommodate all the role instances and retry count that each instance would hold, this will result in DeliveryCount reaching it's max while NServiceBus endpoint instance still thinks it has attempts before sending message to an error queue and logging an error. Similar to the question here I'd recommend to increase MaxDeliveryCount.
There is an open NServiceBus issue to have a native support for SLR counter. You can add your voice to the issue. The next version of NServiceBus (V6) will be logging message id along with the exception so that you at least could correlate between message in DLQ and log file.