How to diagnose "the operation has timed out" HttpException - vb.net

I am calling 5 external servers to retrieve XML-based data for each request for a particular webpage on my IIS 6 server. Present volume is between 3-5 incoming requests per second, meaning 15-20 outgoing requests per second.
99% of the outgoing requests from my server (the client) to the external servers (the server) work OK but about 100-200 per day end up with a "The operation has timed out" exception.
This suggests I have a resource problem on my server - some shortage of sockets, ports etc or a thread lock but the problem with this theory is that the failures are entirely random - there are not a number of requests in a row that all fail - and two of the external servers account for the majority of the failures.
My question is how can I further diagnose these exceptions to determine if the problem is on my end (the client) or on the other end (the servers)?
The volume of requests precludes putting an analyzer on the wire - it would be very difficult to capture these few exceptions. I have reset CONNECTIONS and THREADS in my machine.config and the basic code looks like:
Dim hRequest As HttpWebRequest
Dim responseTime As String
Dim objWatch As New Stopwatch
Try
' calculate time it takes to process transaction
objWatch.Start()
hRequest = System.Net.WebRequest.Create(url)
' set some defaults
hRequest.Timeout = 5000
hRequest.ReadWriteTimeout = 10000
hRequest.KeepAlive = False ' to prevent open HTTP connection leak
hRequest.SendChunked = False
hRequest.AllowAutoRedirect = True
hRequest.MaximumAutomaticRedirections = 3
hRequest.Accept = "text/xml"
hRequest.Proxy = Nothing 'do not waste time searching for a proxy
hRequest.ServicePoint.Expect100Continue = False
Dim feed As New XDocument()
' use *Using* to auto close connections
Using hResponse As HttpWebResponse = DirectCast(hRequest.GetResponse(), HttpWebResponse)
Using reader As XmlReader = XmlReader.Create(hResponse.GetResponseStream())
feed = XDocument.Load(reader)
reader.Close()
End Using
hResponse.Close()
End Using
objWatch.Stop()
' Work here with returned contents in "feed" document
Return XXX' some results here
Catch ex As Exception
objWatch.Stop()
hRequest.Abort()
Return Nothing
End Try
Any suggestions?

By default, HttpWebRequest limits you to 2 connections per HTTP/1.1 server. So, if your requests take time to complete, and you have incoming requests queuing up on the server, you will run out of connection and thus get timeouts.
You should change the max outgoing connections on ServicePointManager.
ServicePointManager.DefaultConnectionLimit = 20 // or some big value.

You said that you are doing 5 outgoing request for each incoming request to the ASP page. Is that 5 different servers, or the same server?
DO you wait for the previous request to complete, before issuing the next one? Is the timeout happening while it is waiting for a connection, or during the request/response?
If the timeout is happening during the request/response then it means that the target server is under stress. The only way to find out if this is the case, is to run wireshark/netmon on one of the machines, and look at the network trace to see if the request from the app is even making it through to the server, and if it is, whether the target server is responding within the given timeout.
If this is a thread starvation issue, then one of the ways to diagnose it is to attach windbg.exe debugger to w3wp.exe process, when you start getting timeout. Then load the sos.dll debugging extension. And run the !threads command, followed by !threadpool command. It will show you how many Worker threads and completion port threads are utilized/remaining. If the #completionport threads or worker threads are low, then that will contribute to the timeout.
Alternatively, you can monitor ASP.NET and System.net perf counters. See if the ASP.NET request queue is increasing monotonically - this might indicate that your outgoing requests are not completing fast enough.
Sorry, there are no easy answers here. THere is a lot of avenues you will need to explore. If I were you, I would start off by attaching windbg.exe to w3wp when you start getting timeouts and do what I described earlier.

Related

Setting a timeout on webservice consumer built with org.apache.axis.client.Call and running on Domino

I'm maintaining an antedeluvian Notes application which connects to a SAP back-end via a manually done 'Webservice'
The server is running Domino Release 7.0.4FP2 HF97.
The Webservice is not the more recently Webservice Consumer, but a large Java agent which is using Apache soap.jar (org.apache.soap). Below an example of the calling code.
private Call setupSOAPCall() {
Call call = new Call();
SOAPHTTPConnection conn = new SOAPHTTPConnection();
call.setSOAPTransport(conn);
call.setEncodingStyleURI(Constants.NS_URI_SOAP_ENC);
There has been a change in the SAP system which is now taking 8 minutes to complete (verified by SAP Team).
I'm getting an error message as follows:
[SOAPException: faultCode=SOAP-ENV:Client; msg=For input string: "906 "; targetException=java.lang.NumberFormatException: For input string: "906 "]
I found a blog article describing the error message quite closely:
https://thejavablog.wordpress.com/category/jmeter/
and I've come to the hypothesis that it is a timeout message that is returning to my Call object and that this timeout message is being incorrectly parsed, hence the NumberFormat Exception.
Looking at my logs I can see that there is a time difference of 62 seconds between my call and the response.
I recommended that the server setting in the server document, tab Internet Protocols/HTTP/Timeouts/Request timeouts be changed from 60 seconds to 600 seconds, and the http task restarted with
tell http restart
I've re-run the tests and I am getting the same error, and the time difference is still slightly more than 60 seconds, which is not what I was expecting.
I read Michael Rulnau's blog entry
http://www.mruhnau.net/2014/06/how-to-overcome-domino-webservice.html
which points to this APR
http://www-01.ibm.com/support/docview.wss?uid=swg1LO48272
but I'm not convinced that this would apply in this case, since there is no way that IBM would know that my Java agent is in fact making a Soap call.
My current hypothesis is that I have to use either the setTimeout() method on
org.apache.axis.client.Call
https://axis.apache.org/axis/java/apiDocs/org/apache/axis/client/Call.html
or on the org.apache.soap.transport.http.SOAPHTTPConnection
https://docs.oracle.com/cd/B13789_01/appdev.101/b12024/org/apache/soap/transport/http/SOAPHTTPConnection.html
and that the timeout value is an apache default, not something that is controlled by the Domino server.
I'd be grateful for any help.
I understand your approach, and I hope this is the correct one to solve your problem.
Add a debug (console write would be fine) that display the default Timeout then try to increase it to 10 min.
SOAPHTTPConnection conn = new SOAPHTTPConnection();
System.out.println("time out is :" + conn.getTimeout());
conn.setTimeout(600000);//10 min in ms
System.out.println("after setting it, time out is :" + conn.getTimeout());
call.setSOAPTransport(conn);
Now keep in mind that Dommino has also a Max LotusScript/Java execution time, check this value and (at least for a try) change it: http://www.ibm.com/support/knowledgecenter/SSKTMJ_9.0.1/admin/othr_servertasksagentmanagertab_r.html (it's version 9 help but this part should be identical)
I've since discovered that it wasn't my code generating the error; the default timeout for the apache axis SOAPHTTPConnetion is 0, i.e. no timeout.

ServerXmlHttpRequest hanging sometimes when doing a POST

I have a job that periodically does some work involving ServerXmlHttpRquest to perform an HTTP POST. The job runs every 60 seconds.
And normally it runs without issue. But there's about a 1 in 50,000 chance (every two or three months) that it will hang:
IXMLHttpRequest http = new ServerXmlHttpRequest();
http.open("POST", deleteUrl, false, "", "");
http.send(stuffToDelete); <---hang
When it hangs, not even the Task Scheduler (with the option enabled to kill the job if it takes longer than 3 minutes to run) can end the task. I have to connect to the remote customer's network, get on the server, and use Task Manager to kill the process.
And then its good for another month or three.
Eventually i started using Task Manager to create a process dump,
so i could analyze where the hang is. After five crash dumps (over the last 11 months or so) i get a consistent picture:
ntdll.dll!_NtWaitForMultipleObjects#20()
KERNELBASE.dll!_WaitForMultipleObjectsEx#20()
user32.dll!MsgWaitForMultipleObjectsEx()
user32.dll!_MsgWaitForMultipleObjects#20()
urlmon.dll!CTransaction::CompleteOperation(int fNested) Line 2496
urlmon.dll!CTransaction::StartEx(IUri * pIUri, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4453 C++
urlmon.dll!CTransaction::Start(const wchar_t * pwzURL, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4515 C++
msxml3.dll!URLMONRequest::send()
msxml3.dll!XMLHttp::send()
Contoso.exe!FrobImporter.TFrobImporter.DeleteFrobs Line 971
Contoso.exe!FrobImporter.TFrobImporter.ImportCore Line 1583
Contoso.exe!FrobImporter.TFrobImporter.RunImport Line 1070
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.HandleFrobImport Line 433
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.CoreExecute Line 71
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.Execute Line 84
Contoso.exe!Contoso.Contoso Line 167
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
So i do a ServerXmlHttpRequest.send, and it never returns. It will sit there for days (causing the system to miss financial transactions, until come Sunday night i get a call that it's broken).
It is of no help unless someone knows how to debug code, but the registers in the stalled thread at the time of the dump are:
EAX 00000030
EBX 00000000
ECX 00000000
EDX 00000000
ESI 002CAC08
EDI 00000001
EIP 732A08A7
ESP 0018F684
EBP 0018F6C8
EFL 00000000
Windows Server 2012 R2
Microsoft IIS/8.5
Default timeouts of ServerXmlHttpRequest
You can use serverXmlHttpRequest.setTimeouts(...) to configure the four classes of timeouts:
resolveTimeout: The value is applied to mapping host names (such as "www.microsoft.com") to IP addresses; the default value is infinite, meaning no timeout.
connectTimeout: A long integer. The value is applied to establishing a communication socket with the target server, with a default timeout value of 60 seconds.
sendTimeout: The value applies to sending an individual packet of request data (if any) on the communication socket to the target server. A large request sent to a server will normally be broken up into multiple packets; the send timeout applies to sending each packet individually. The default value is 30 seconds.
receiveTimeout: The value applies to receiving a packet of response data from the target server. Large responses will be broken up into multiple packets; the receive timeout applies to fetching each packet of data off the socket. The default value is 30 seconds.
The KB305053 (a server that decides to keep the connection open will cause serverXmlHttpRequest to wait for the connection to close) seems like it plausibly could be the issue. But the 30 second default timeout would have taken care of that.
Possible workaround - Add myself to a Job
The Windows Task Scheduler is unable to terminate the task; even though the option is enabled to do do.
I will look into using the Windows Job API to add my self process to a job, and use SetInformationJobObject to set a time limit on my process:
CreateJobObject
AssignProcessToJobObject
SetInformationJobObject
to limit my process to three minutes of execution time:
PerProcessUserTimeLimit
If LimitFlags specifies
JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process
user-mode execution time limit, in 100-nanosecond ticks. Otherwise,
this member is ignored.
The system periodically checks to determine
whether each process associated with the job has accumulated more
user-mode time than the set limit. If it has, the process is
terminated.
If the job is nested, the effective limit is the most
restrictive limit in the job chain.
Although since Task Scheduler uses Job objects to also limit a task's time, i'm not hopeful that the Job Object can limit a job either.
Edit: Job objects cannot limit a process by process time - only user time. And with a process idle waiting for an object, it will not accumulate any user time - certainly not three minutes worth.
Bonus Reading
How can a ServerXMLHTTP GET request hang? (GET, not POST)
KB305053: ServerXMLHTTP Stops Responding When You Send a POST Request (which says the timeout should expire; where mine does not)
MS Forums: oHttp.Send - Hangs (HEAD, not POST)
MS Forums: ASP to test SOAP WebService using MSXML2.ServerXMLHTTP Send hangs
CC to MS Support Forums
Consider switching to a newer, supported API.
msxml6.dll using MSXML2.ServerXMLHTTP.6.0
winhttpcom.dll using WinHttp.WinHttpRequest.5.1.
The msxml3.dll library is no longer supported and is only kept around for compatibility reasons. Plus, there were a number of security and stability improvements included with msxml4.dll (and newer) that you are missing out on.

How to make pooling HTTP connection with twisted?

I wirite a very simple spider program to fetch webpages from single site.
Here is a minimized version.
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler)
return response
def onBody(body, i):
print('Received %s, Length %s' % (i, len(body)))
def errorHandler(err):
print('%s : %s' % (reactor.seconds() - startTimeStamp, err))
def requestFactory():
for i in range (start, end):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler)
print('Generated %s' % i)
reactor.iterate(1)
print('All requests has generated, elpased %s' % (reactor.seconds() - startTimeStamp))
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(requestFactory)
reactor.run()
For a few requests, like 100, it works fine. But for massive requests, it will failed.
I expect all of the requests(around 3000) should be automatically pooled, scheduled and pipelined, since I use HTTPConnectionPool, set maxPersistentPerHost, create an Agent instance with it and incrementally create the connections.
But it doesn't, the connections are not keep-alive nor pooled.
In this programm, it did establish the connections incrementally, but the connections didn't pooled, each connecction will close after body received, and later requests never wait in the pool for an available connecction.
So it will take thousands of sockets, and finally failed due to timeout, because the remote server has a connection timeout set to 30s. Thousands of requests can't be done within 30s.
Could you please give me some help on this?
I have tried my best on this, here is my finds.
Error occured exactly 30s after reactor start runing, won't be influenced by other things.
Let the spider fetch my server, I find something interesting.
The HTTP protocol version is 1.1 (I check the twisted document, the default HTTPClient is 1.0 rather than 1.1)
If I didn't add any explicit header(just like the minimized version), the request header didn't contain Connection: Keep-Alive, either do response header.
If I add explicit header to ensure it is a keep-alive connection, the request header did contain Connection: Keep-Alive, but the response header still not. (I am sure my server behave correctly, other stuff like Chrome, wget did receive Connection: Keep-Alive header.)
I check /proc/net/sockstat during running, it increase rapidly at first, and decrease rapidly later. (I have increase the ulimit to support plenty of sockets)
I write a similar program with treq, a twisted based request library). The code is almost the same, so not paste here.
Link: https://gist.github.com/Preffer/dad9b1228fcd75cebd75
It's behavior is almost the same. Not pooling. It is expected to be pooling as described in treq's feature list.
If I have add explicit header on it, Connection: Keep-Alive never appear in response header.
Based on all of the above, I am quite suspicious about the quirk Connection: Keep-Alive header ruin the program. But this header is part of HTTP 1.1 standard, and it did report as HTTP 1.1. I am completely puzzled on this.
I solved the problem myself, with help from IRC and another question in stackoverflow, Queue remote calls to a Python Twisted perspective broker?
In summary, the agent's behavior is very different from that in Nodejs(I have some experience in Nodejs). As it described on Nodejs doc
agent.requests
An object which contains queues of requests that have not yet been assigned to sockets.
agent.maxSockets
By default set to 5. Determines how many concurrent sockets the agent can have open per origin. Origin is either a 'host:port' or 'host:port:localAddress' combination.
So, here is the difference.
Twisted:
There is no doubt that Agent could queue requests if construct with a HTTPConnectionPool instance.
But if a new request is issued after connections in pool has run out, the agent will still create a new connection and perform the request, rather than put it in a queue.
Actually, it will lead to drop a connection in the pool, and push the newly generated connection into the pool, keep the connections count still equal to maxPersistentPerHost
Nodejs:
By default, agent will queue the requests with a implicit connection pool, which have a size of 5 connections.
If a new request is issued after connections in pool has run out, the agent will queue the requests into agent.requests variable, waiting for available connection.
The agent's behavior in twisted lead to a result that the agent is able to queue the requests, but actually it doesn't.
Follow our intuition, once assign a connection pool to an agent, it is in line with the intuition that agent will only use the connections in the pool, and wait for available connection if the pool has run out. That is exactly match with the agent in Nodejs.
Personally, I think it is a buggy behavior in twisted, or at least, could make an improvement to provide an option to set agent's behavior.
According to this, I have to use DeferredSemaphore to manually schedule the requests.
I raise a issue to treq project on github, and get similar solution. https://github.com/dreid/treq/issues/71
Here is my solution.
#!/usr/bin/env python
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
from twisted.internet.defer import DeferredSemaphore
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
count = end - start
concurrency = 10
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = concurrency
agent = Agent(reactor, pool=pool)
sem = DeferredSemaphore(concurrency)
done = 0
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler, i)
return deferred
def onBody(body, i):
sem.release()
global done, count
done += 1
print('Received %s, Length %s, Done %s' % (i, len(body), done))
if(done == count):
print('All items fetched')
reactor.stop()
def errorHandler(err, i):
print('[%s] id %s: %s' % (reactor.seconds() - startTimeStamp, i, err))
def requestFactory(token, i):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler, i)
print('Request send %s' % i)
#this function it self is a callback emit by reactor, so needn't iterate manually
#reactor.iterate(1)
return deferred
def assign():
for i in range (start, end):
sem.acquire().addCallback(requestFactory, i)
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(assign)
reactor.run()
Is it right? Beg for pointing out my error and improvements.
For a few requests, like 100, it works fine. But for massive requests,
it will failed.
This is either a protection against web crawlers or a server protection against DoS/DDoS, because you are sending too much requests from the same IP in a short time, so the Firewall or the WSA will block your future request. Just modify your script to make request in batch spaced by some time. you can use callLater() with some time after each X request.

WebDAV & Exchange 2003 - works in debug, fails without

So I have a very weird problem.
I am writing some code in VB.Net under .NET 2.0 which interfaces with MS Exchange 2003. Because of the Exchange 2003 "requirement" I am forced to write this code using WEBDAV.
The code itself is replicating, to some degree, a schedule management process. It's creating Appointments on the Exchange Server in response to inputs from the user and managing it's data internally in a SQL Server database.
The problem situation is this: A new person is assigned to be in charge of a meeting. The requirement says the program should generate a meeting cancellation request to the person removed from the meeting (if such a person existed) and a meeting request sent to the new person.
In the case of there being an existing person, what appears to happen is this:
The meeting cancellation request
gets sent
Exchange barfs and returns status code 500 (internal server error) during the set of
requests which send the meeting request to the new person.
However! While debugging this particular scenario, it works just fine for me, if I step through the code in the Visual Studio debugger. Left to it's own devices, it fails every time.
Just for yuk's sake, I added a Thread.Sleep(500) to the part after sending the cancellation request, and Exchange doesn't barf anymore...
So, my question!
If adding a Thread.Sleep to the code causes this error to go away, a race condition is implied, no? But, my code is running under a web application and is a totally single threaded process, from start to finish. The web requests I am sending are all in synchronous mode so this shouldn't be a problem.
What would I do next to try and track down the issue?
Try and divine if the race condition itself is in the .Net 2.0 BCL networking code?
Try and do some debugging on the Exchange server itself?
Ignore it, be glad the Thread.Sleep masks the problem and keep on going?
Any further suggestions would be wonderful.
In response to comment, I can post the failing function:
Private Shared Sub UpdateMeeting(ByVal folder As String, ByVal meetingId As String, ByVal oldAssignedId As String, ByVal newAssignedTo As String, ByVal transaction As DbTransaction)
If String.IsNullOrEmpty(meetingId) Then
Throw New Exception("Outlook ID for that date and time is empty.")
End If
Dim x As New Collections.Generic.List(Of String)
If oldAssignedId <> newAssignedTo AndAlso Not String.IsNullOrEmpty(oldAssignedId) Then
'send cancellation to old person
Dim lGetCounselorEmail1 As String = GetCounselorEmail(oldAssignedId, transaction)
Common.Exchange.SendCancellation(meetingId, New String() {lGetCounselorEmail1})
' Something very weird here. Running this code through the debugger works fine. Running without causes exchange to return 500 - Internal Server Error.
Threading.Thread.Sleep(500)
End If
x.Add(folder)
If Not String.IsNullOrEmpty(newAssignedTo) Then x.Add(GetCounselorEmail(newAssignedTo, transaction))
x.RemoveAll(AddressOf String.IsNullOrEmpty)
If x.Count > 0 Then
If Not Common.Exchange.UpdateMeetingAttendees(meetingId, x.ToArray()) Then
Throw New Exception("Failure during update of calendar")
End If
End If
End Sub
...but a lot of the implementation details are hidden here, as I wrote a set of classes to interface with Exchange WebDAV.
Ended up sticking with the Sleep and calling it a day.
My 'belief' is that I was erroneous in thinking that a WebRequest/WebResponse combo sent to Exchange through WebDav was an atomic operation.

WCF ReliableSession and Timeouts

I have a WCF service used mainly for managing documents in a repository.
I used the chunking channel sample from MS so that I could upload/download huge files.
Now I implemented reliable session with the service and I am seeing some strange behaviors.
Here are the timeout values I am using.
this.SendTimeout = new TimeSpan(0,10,0);
this.OpenTimeout = new TimeSpan(0, 1, 0);
this.CloseTimeout = new TimeSpan(0, 1, 0);
this.ReceiveTimeout = new TimeSpan(0,10, 0);
reliableBe.InactivityTimeout = new TimeSpan(0,2,0);
I have the following issues:
1. If the Service is not up & running, the clients are not get disconnected after OpenTimeout.
I tried it with my test client.
Scenario 1: Without Reliable Session:
I get the following exception:
Could not connect to net.tcp://localhost:8788/MediaManagementService/ep1. The connection attempt lasted for a time span of 00:00:00.9848790. TCP error code 10061: No connection could be made because the target machine actively refused it 127.0.0.1:8788
This is the correct behavior as I have given the OpenTimeout as 1 sec.
Scenario 2: With ReliableSession:
I get the same exception:
Could not connect to net.tcp://localhost:8788/MediaManagementService/ep1. The connection attempt lasted for a time span of 00:00:00.9692460. TCP error code 10061: No connection could be made because the target machine actively refused it 127.0.0.1:8788.
But this message comes after around 10 mintes . (I believe after SendTimeout)
So here I just have enabled the reliable session and now it looks like the OpenTimeout = SendTimeout for the client.
Is this desired behavior?
2: Issue while uploading huge files with ReliableSession:
The general rule is that you have to set a huge value for the maxReceivedMessageSize, SendTimeout and ReceiveTimeout.
But in the case of Chunking channel, the max received message size doesn't matter as the data is sent in chunks.
So I set a huge value for Send and ReceiveTimeout : say 10 hours.
Now the upload is going fine, but it has a side effect that, even if the Service is not up, it takes 10 hours to timeout the client connection due to the behavior mentioned in (1).
Please let me know your thoughts on this behavior.