Cancel signal (close connection) what could be the cause? - spring-webflux

In Kibana of our application, I keep seeing this line of log from org.springframework.web.reactive.function.client.ExchangeFunctions:
[2f5e234b] Cancel signal (to close connection)
The thread is reactor-http-epoll-1 or so.
It could happen in two situations:
when the connection is successful and returns a response, then it does not matter
when for some unknown cause, after 10 seconds, the connection does not return anything, and this line also happens, and period, nothing more. It seems to be a timeout but I am not sure(because the default timeout in my WebClient config is 10s)
What could be the cause of this? Client active drop or server active refusal?
Is the 2nd case a timeout? But not TimeoutException() is thrown afterwards.
I now do a doOnCancel() logging in WebClient to deal with the 2nd case, but then I notice there is case 1, and this doOnCancel() handling does not make sense anymore, because it seems to happen in all cases.

I have the same log. But in my WebClient i returned Mono.empty() and the method signature was Mono< Void>. After changing to Mono< String> the problem was gone.

Related

Kafka streams: groupByKey and reduce not triggering action exactly once when error occurs in stream

I have a simple Kafka streams scenario where I am doing a groupyByKey then reduce and then an action. There could be duplicate events in the source topic hence the groupyByKey and reduce
The action could error and in that case, I need the streams app to reprocess that event. In the example below I'm always throwing an error to demonstrate the point.
It is very important that the action only ever happens once and at least once.
The problem I'm finding is that when the streams app reprocesses the event, the reduce function is being called and as it returns null the action doesn't get recalled.
As only one event is produced to the source topic TOPIC_NAME I would expect the reduce to not have any values and skip down to the mapValues.
val topologyBuilder = StreamsBuilder()
topologyBuilder.stream(
TOPIC_NAME,
Consumed.with(Serdes.String(), EventSerde())
)
.groupByKey(Grouped.with(Serdes.String(), EventSerde()))
.reduce { current, _ ->
println("reduce hit")
null
}
.mapValues { _, v ->
println(Id: "${v.correlationId}")
throw Exception("simulate error")
}
To cause the issue I run the streams app twice. This is the output:
First run
Id: 90e6aefb-8763-4861-8d82-1304a6b5654e
11:10:52.320 [test-app-dcea4eb1-a58f-4a30-905f-46dad446b31e-StreamThread-1] ERROR org.apache.kafka.streams.KafkaStreams - stream-client [test-app-dcea4eb1-a58f-4a30-905f-46dad446b31e] All stream threads have died. The instance will be in error state and should be closed.
Second run
reduce hit
As you can see the .mapValues doesn't get called on the second run even though it errored on the first run causing the streams app to reprocess the same event again.
Is it possible to be able to have a streams app re-process an event with a reduced step where it's treating the event like it's never seen before? - Or is there a better approach to how I'm doing this?
I was missing a property setting for the streams app.
props["processing.guarantee"]= "exactly_once"
By setting this, it will guarantee that any state created from the point of picking up the event will rollback in case of a exception being thrown and the streams app crashing.
The problem was that the streams app would pick up the event again to re-process but the reducer step had state which has persisted. By enabling the exactly_once setting it ensures that the reducer state is also rolled back.
It now successfully re-processes the event as if it had never seen it before

Stop consuming from KafkaReceiver after a timeout

I have a common rest controller:
private final KafkaReceiver<String, Domain> receiver;
#GetMapping(produces = MediaType.APPLICATION_STREAM_JSON_VALUE)
public Flux<Domain> produceFluxMessages() {
return receiver.receive().map(ConsumerRecord::value)
.timeout(Duration.ofSeconds(2));
}
What I am trying to achieve is to collect messages from Kafka topic for a certain period of time, and then just stop consuming and consider this flux completed. If I remove timeout and open this in a browser, I am getting messages forever, downloading never stops. And with this timeout consuming stops after 2 seconds, but I'm getting an exception:
java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 2000ms in 'map' (and no fallback has been configured)
Is there a way to successfully complete Flux after timeout?
There's multiple overloads of the timeout() method - you're using the standard one that throws an exception on timeout.
Instead, just use the overloaded timeout method to provide an empty default publisher to fallback to:
timeout(Duration.ofSeconds(2), Mono.empty())
(Note in a general case you could explicitly capture the TimeoutException and fallback to an empty publisher using onErrorResume(TimeoutException.class, e -> Mono.empty()), but that's much less preferable to using the above option where possible.)

Setting a timeout on webservice consumer built with org.apache.axis.client.Call and running on Domino

I'm maintaining an antedeluvian Notes application which connects to a SAP back-end via a manually done 'Webservice'
The server is running Domino Release 7.0.4FP2 HF97.
The Webservice is not the more recently Webservice Consumer, but a large Java agent which is using Apache soap.jar (org.apache.soap). Below an example of the calling code.
private Call setupSOAPCall() {
Call call = new Call();
SOAPHTTPConnection conn = new SOAPHTTPConnection();
call.setSOAPTransport(conn);
call.setEncodingStyleURI(Constants.NS_URI_SOAP_ENC);
There has been a change in the SAP system which is now taking 8 minutes to complete (verified by SAP Team).
I'm getting an error message as follows:
[SOAPException: faultCode=SOAP-ENV:Client; msg=For input string: "906 "; targetException=java.lang.NumberFormatException: For input string: "906 "]
I found a blog article describing the error message quite closely:
https://thejavablog.wordpress.com/category/jmeter/
and I've come to the hypothesis that it is a timeout message that is returning to my Call object and that this timeout message is being incorrectly parsed, hence the NumberFormat Exception.
Looking at my logs I can see that there is a time difference of 62 seconds between my call and the response.
I recommended that the server setting in the server document, tab Internet Protocols/HTTP/Timeouts/Request timeouts be changed from 60 seconds to 600 seconds, and the http task restarted with
tell http restart
I've re-run the tests and I am getting the same error, and the time difference is still slightly more than 60 seconds, which is not what I was expecting.
I read Michael Rulnau's blog entry
http://www.mruhnau.net/2014/06/how-to-overcome-domino-webservice.html
which points to this APR
http://www-01.ibm.com/support/docview.wss?uid=swg1LO48272
but I'm not convinced that this would apply in this case, since there is no way that IBM would know that my Java agent is in fact making a Soap call.
My current hypothesis is that I have to use either the setTimeout() method on
org.apache.axis.client.Call
https://axis.apache.org/axis/java/apiDocs/org/apache/axis/client/Call.html
or on the org.apache.soap.transport.http.SOAPHTTPConnection
https://docs.oracle.com/cd/B13789_01/appdev.101/b12024/org/apache/soap/transport/http/SOAPHTTPConnection.html
and that the timeout value is an apache default, not something that is controlled by the Domino server.
I'd be grateful for any help.
I understand your approach, and I hope this is the correct one to solve your problem.
Add a debug (console write would be fine) that display the default Timeout then try to increase it to 10 min.
SOAPHTTPConnection conn = new SOAPHTTPConnection();
System.out.println("time out is :" + conn.getTimeout());
conn.setTimeout(600000);//10 min in ms
System.out.println("after setting it, time out is :" + conn.getTimeout());
call.setSOAPTransport(conn);
Now keep in mind that Dommino has also a Max LotusScript/Java execution time, check this value and (at least for a try) change it: http://www.ibm.com/support/knowledgecenter/SSKTMJ_9.0.1/admin/othr_servertasksagentmanagertab_r.html (it's version 9 help but this part should be identical)
I've since discovered that it wasn't my code generating the error; the default timeout for the apache axis SOAPHTTPConnetion is 0, i.e. no timeout.

Suppressing NServicebus Transaction to write errors to database

I'm using NServiceBus to handle some calculation messages. I have a new requirement to handle calculation errors by writing them the same database. I'm using NHibernate as my DAL which auto enlists to the NServiceBus transaction and provides rollback in case of exceptions, which is working really well. However if I write this particular error to the database, it is also rolled back which is a problem.
I knew this would be a problem, but I thought I could just wrap the call in a new transaction with the TransactionScopeOption = Suppress. However the error data is still rolled back. I believe that's because it was using the existing session with has already enlisted in the NServiceBus transaction.
Next I tried opening a new session from the existing SessionFactory within the suppression transaction scope. However the first call to the database to retrieve or save data using this new session blocks and then times out.
InnerException: System.Data.SqlClient.SqlException
Message=Timeout expired. The timeout period elapsed prior to completion of the >operation or the server is not responding.
Finally I tried creating a new SessionFactory using it to open a new session within the suppression transaction scope. However again it blocks and times out.
I feel like I'm missing something obvious here, and would greatly appreciate any suggestions on this probably common task.
As Adam suggests in the comments, in most cases it is preferred to let the entire message fail processing, giving the built-in Retry mechanism a chance to get it right, and eventually going to the error queue. Then another process can monitor the error queue and do any required notification, including logging to a database.
However, there are some use cases where the entire message is not a failure, i.e. on the whole, it "succeeds" (whatever the business-dependent definition of that is) but there is some small part that is in error. For example, a financial calculation in which the processing "succeeds" but some human element of the data is "in error". In this case I would suggest catching that exception and sending a new message which, when processed by another endpoint, will log the information to your database.
I could see another case where you want the entire message to fail, but you want the fact that it was attempted noted somehow. This may be closest to what you are describing. In this case, create a new TransactionScope with TransactionScopeOption = Suppress, and then (again) send a new message inside that scope. That message will be sent whether or not your full message transaction rolls back.
You are correct that your transaction is rolling back because the NHibernate session is opened while the transaction is in force. Trying to open a new session inside the suppressed transaction can cause a problem with locking. That's why, most of the time, sending a new message asynchronously is part of the solution in these cases, but how you do it is dependent upon your specific business requirements.
I know I'm late to the party, but as an alternative suggestion, you coudl simply raise another separate log message, which NSB handles independently, for example:
public void Handle(DebitAccountMessage message)
{
var account = this.dbcontext.GetById(message.Id);
if (account.Balance <= 0)
{
// log request - new handler
this.Bus.Send(new DebitAccountLogMessage
{
originalMessage = message,
account = account,
timeStamp = DateTime.UtcNow
});
// throw error - NSB will handle
throw new DebitException("Not enough funds");
}
}
public void Handle(DebitAccountLogMessage message)
{
var messageString = message.originalMessage.Dump();
var accountString = message.account.Dump(DumpOptions.SuppressSecurityTokens);
this.Logger.Log(message.UniqueId, string.Format("{0}, {1}", messageString, accountString);
}

How to diagnose "the operation has timed out" HttpException

I am calling 5 external servers to retrieve XML-based data for each request for a particular webpage on my IIS 6 server. Present volume is between 3-5 incoming requests per second, meaning 15-20 outgoing requests per second.
99% of the outgoing requests from my server (the client) to the external servers (the server) work OK but about 100-200 per day end up with a "The operation has timed out" exception.
This suggests I have a resource problem on my server - some shortage of sockets, ports etc or a thread lock but the problem with this theory is that the failures are entirely random - there are not a number of requests in a row that all fail - and two of the external servers account for the majority of the failures.
My question is how can I further diagnose these exceptions to determine if the problem is on my end (the client) or on the other end (the servers)?
The volume of requests precludes putting an analyzer on the wire - it would be very difficult to capture these few exceptions. I have reset CONNECTIONS and THREADS in my machine.config and the basic code looks like:
Dim hRequest As HttpWebRequest
Dim responseTime As String
Dim objWatch As New Stopwatch
Try
' calculate time it takes to process transaction
objWatch.Start()
hRequest = System.Net.WebRequest.Create(url)
' set some defaults
hRequest.Timeout = 5000
hRequest.ReadWriteTimeout = 10000
hRequest.KeepAlive = False ' to prevent open HTTP connection leak
hRequest.SendChunked = False
hRequest.AllowAutoRedirect = True
hRequest.MaximumAutomaticRedirections = 3
hRequest.Accept = "text/xml"
hRequest.Proxy = Nothing 'do not waste time searching for a proxy
hRequest.ServicePoint.Expect100Continue = False
Dim feed As New XDocument()
' use *Using* to auto close connections
Using hResponse As HttpWebResponse = DirectCast(hRequest.GetResponse(), HttpWebResponse)
Using reader As XmlReader = XmlReader.Create(hResponse.GetResponseStream())
feed = XDocument.Load(reader)
reader.Close()
End Using
hResponse.Close()
End Using
objWatch.Stop()
' Work here with returned contents in "feed" document
Return XXX' some results here
Catch ex As Exception
objWatch.Stop()
hRequest.Abort()
Return Nothing
End Try
Any suggestions?
By default, HttpWebRequest limits you to 2 connections per HTTP/1.1 server. So, if your requests take time to complete, and you have incoming requests queuing up on the server, you will run out of connection and thus get timeouts.
You should change the max outgoing connections on ServicePointManager.
ServicePointManager.DefaultConnectionLimit = 20 // or some big value.
You said that you are doing 5 outgoing request for each incoming request to the ASP page. Is that 5 different servers, or the same server?
DO you wait for the previous request to complete, before issuing the next one? Is the timeout happening while it is waiting for a connection, or during the request/response?
If the timeout is happening during the request/response then it means that the target server is under stress. The only way to find out if this is the case, is to run wireshark/netmon on one of the machines, and look at the network trace to see if the request from the app is even making it through to the server, and if it is, whether the target server is responding within the given timeout.
If this is a thread starvation issue, then one of the ways to diagnose it is to attach windbg.exe debugger to w3wp.exe process, when you start getting timeout. Then load the sos.dll debugging extension. And run the !threads command, followed by !threadpool command. It will show you how many Worker threads and completion port threads are utilized/remaining. If the #completionport threads or worker threads are low, then that will contribute to the timeout.
Alternatively, you can monitor ASP.NET and System.net perf counters. See if the ASP.NET request queue is increasing monotonically - this might indicate that your outgoing requests are not completing fast enough.
Sorry, there are no easy answers here. THere is a lot of avenues you will need to explore. If I were you, I would start off by attaching windbg.exe to w3wp when you start getting timeouts and do what I described earlier.