Asterisk: "TLS clean shutdown alert reading data" after 120s in SIP call - ssl

I am using a Secure SIP trunk provided by Twilio to implement an IVR. I have implemented per Twilio's Asterisk configuration guide, installed SRTP to /usr/local/lib, as well as implemented the configuration in https://wiki.asterisk.org/wiki/display/AST/Secure+Calling+Tutorial.
The problem lies in any call that is longer than 2 minutes cannot be ended cleanly and causes Asterisk to restart.
sip.conf (using chan_sip, not pjsip):
[general]
; other configuration lines removed
tlsenable=yes
tlsbindaddr=0.0.0.0
tlscertfile=/etc/pki/tls/private/pbx.pem
tlscafile=/etc/pki/tls/private/gd_bundle-g2-g1.crt
tlscipher=ALL
tlsclientmethod=tlsv1
tlsdontverifyserver=yes
[twilio-trunk](!)
type=peer
context=from-twilio ;Which dialplan to use for incoming calls
dtmfmode=rfc4733
canreinvite=no
insecure=port,invite
transport=tls
qualify=yes
encryption=yes
media_encryption=sdes
I can make and receive calls just fine, and I have confirmed the calls are encrypted both via wireshark and confirmation from Twilio's own support queue.
At exactly 120 seconds into every call, this debug pops up:
[Dec 6 13:14:39] DEBUG[30015]: iostream.c:157 iostream_read: TLS clean shutdown alert reading data
[Dec 6 13:14:39] DEBUG[30015]: chan_sip.c:2905 sip_tcptls_read: SIP TCP/TLS server has shut down
The call continues to flow bi-directionally, the caller never knows there is a problem until they hit a hangup in context, i.e. h,1,Hangup(). Then Asterisk is restarted (new PID) and the caller hangs in limbo for another 5 minutes before the call times out with a fast busy. Twilio confirms they see the BYE and return an ACK at the point of the Hangup.
I was on 13.11 and updated to 15.1.3, same result. Calls longer than 120s result in TLS message in debug and Asterisk restarts.
No Google query results out there. Twilio hasn't been real helpful. Can anyone shed some light on what is happening and where I need to look next?
More logs:
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: channel.c:5551 set_format: Channel SIP/twilio0-00000000 setting write format path: gsm -> ulaw
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4017 rtp_raw_write: Difference is 2472, ms is 329
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (50 requested / 50 actual) timer ticks per second
– <SIP/twilio0-00000000> Playing ‘IVR/omnicare_9d_account.gsm’ (language ‘en’)
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4928 ast_rtcp_interpret: Got RTCP report of 64 bytes from 34.203.250.7:10475
[Dec 8 10:18:53] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4928 ast_rtcp_interpret: Got RTCP report of 64 bytes from 34.203.250.7:10475
[Dec 8 10:18:55] DEBUG[4992]: iostream.c:157 iostream_read: TLS clean shutdown alert reading data
[Dec 8 10:18:55] DEBUG[4992]: chan_sip.c:2905 sip_tcptls_read: SIP TCP/TLS server has shut down
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:5551 set_format: Channel SIP/twilio0-00000000 setting write format path: ulaw -> ulaw
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4928 ast_rtcp_interpret: Got RTCP report of 64 bytes from 34.203.250.7:10475
[Dec 8 10:19:01] DEBUG[4914]: cdr.c:4305 ast_cdr_engine_term: CDR Engine termination request received; waiting on messages…
Asterisk uncleanly ending (0).
Executing last minute cleanups
== Destroying musiconhold processes
[Dec 8 10:19:01] DEBUG[4914]: res_musiconhold.c:1627 moh_class_destructor: Destroying MOH class ‘default’
[Dec 8 10:19:01] DEBUG[4914]: cdr.c:1289 cdr_object_finalize: Finalized CDR for SIP/twilio0-00000000 - start 1512749813.880448 answer 1512749813.881198 end 1512749941.201797 dispo ANSWERED
== Manager unregistered action DBGet
== Manager unregistered action DBPut
== Manager unregistered action DBDel
== Manager unregistered action DBDelTree
[Dec 8 10:19:01] DEBUG[4914]: asterisk.c:2157 really_quit: Asterisk ending (0).

Check your firewall logs. We've had issues with sessions being torn down by firewalls that thought the NAT entries were stale/old.
You can also try configuring Asterisk to send keep-alive packets using the option qualify=yes and nat=yes in your sip.conf entry for that user/trunk. Or inside the RTP stream with rtpkeepalive=<secs>. The best docs I could find for sip.conf are the example config on github.
I dug in the source code for the text "TLS clean shutdown alert reading data", which pointed me to some OpenSSL docs which suggest a clean/normal closure (which I'm guessing was caused by your firewall):
The TLS/SSL connection has been closed. If the protocol version is SSL 3.0 or higher, this result code is returned only if a closure alert has occurred in the protocol, i.e. if the connection has been closed cleanly. Note that in this case SSL_ERROR_ZERO_RETURN does not necessarily indicate that the underlying transport has been closed.

Related

Why is this monit config reminder syntax not resulting in repeated alerts?

I have setup a monit config to check that a jenkins build node is connected (its VPN connection is still up) by checking for its VPN IP address on a server that is inside the network already. It seems to work at least once when the computer is not connected. But it only seems to trigger once in a blue moon and not repeatedly like I want it to.
check host JenkinsMacOSXNode with address 192.168.237.10
if failed icmp type echo
count 5 with timeout 5 seconds
2 times within 3 cycles
then alert with reminder on 3 cycles
alert admin#ourdomain.com
Is the above syntax correct for having an alert sent repeatedly when an expected computer is not pingable?
In case the next question is how often is the cycle set to, the /etc/monit/monitrc indicates set daemon 120 so each cycle should be every 2 minutes
Is there a better way to accomplish checking for a computer that should be connected via VPN to the network and alert if it is not?
Try setting the alert with reminder definition before the test:
set alert admin#ourdomain.com with reminder on 3 cycles
check host JenkinsMacOSXNode with address 192.168.237.10
if failed icmp type echo
count 5 with timeout 5 seconds
2 times within 3 cycles
then alert

Setting a timeout on webservice consumer built with org.apache.axis.client.Call and running on Domino

I'm maintaining an antedeluvian Notes application which connects to a SAP back-end via a manually done 'Webservice'
The server is running Domino Release 7.0.4FP2 HF97.
The Webservice is not the more recently Webservice Consumer, but a large Java agent which is using Apache soap.jar (org.apache.soap). Below an example of the calling code.
private Call setupSOAPCall() {
Call call = new Call();
SOAPHTTPConnection conn = new SOAPHTTPConnection();
call.setSOAPTransport(conn);
call.setEncodingStyleURI(Constants.NS_URI_SOAP_ENC);
There has been a change in the SAP system which is now taking 8 minutes to complete (verified by SAP Team).
I'm getting an error message as follows:
[SOAPException: faultCode=SOAP-ENV:Client; msg=For input string: "906 "; targetException=java.lang.NumberFormatException: For input string: "906 "]
I found a blog article describing the error message quite closely:
https://thejavablog.wordpress.com/category/jmeter/
and I've come to the hypothesis that it is a timeout message that is returning to my Call object and that this timeout message is being incorrectly parsed, hence the NumberFormat Exception.
Looking at my logs I can see that there is a time difference of 62 seconds between my call and the response.
I recommended that the server setting in the server document, tab Internet Protocols/HTTP/Timeouts/Request timeouts be changed from 60 seconds to 600 seconds, and the http task restarted with
tell http restart
I've re-run the tests and I am getting the same error, and the time difference is still slightly more than 60 seconds, which is not what I was expecting.
I read Michael Rulnau's blog entry
http://www.mruhnau.net/2014/06/how-to-overcome-domino-webservice.html
which points to this APR
http://www-01.ibm.com/support/docview.wss?uid=swg1LO48272
but I'm not convinced that this would apply in this case, since there is no way that IBM would know that my Java agent is in fact making a Soap call.
My current hypothesis is that I have to use either the setTimeout() method on
org.apache.axis.client.Call
https://axis.apache.org/axis/java/apiDocs/org/apache/axis/client/Call.html
or on the org.apache.soap.transport.http.SOAPHTTPConnection
https://docs.oracle.com/cd/B13789_01/appdev.101/b12024/org/apache/soap/transport/http/SOAPHTTPConnection.html
and that the timeout value is an apache default, not something that is controlled by the Domino server.
I'd be grateful for any help.
I understand your approach, and I hope this is the correct one to solve your problem.
Add a debug (console write would be fine) that display the default Timeout then try to increase it to 10 min.
SOAPHTTPConnection conn = new SOAPHTTPConnection();
System.out.println("time out is :" + conn.getTimeout());
conn.setTimeout(600000);//10 min in ms
System.out.println("after setting it, time out is :" + conn.getTimeout());
call.setSOAPTransport(conn);
Now keep in mind that Dommino has also a Max LotusScript/Java execution time, check this value and (at least for a try) change it: http://www.ibm.com/support/knowledgecenter/SSKTMJ_9.0.1/admin/othr_servertasksagentmanagertab_r.html (it's version 9 help but this part should be identical)
I've since discovered that it wasn't my code generating the error; the default timeout for the apache axis SOAPHTTPConnetion is 0, i.e. no timeout.

ServerXmlHttpRequest hanging sometimes when doing a POST

I have a job that periodically does some work involving ServerXmlHttpRquest to perform an HTTP POST. The job runs every 60 seconds.
And normally it runs without issue. But there's about a 1 in 50,000 chance (every two or three months) that it will hang:
IXMLHttpRequest http = new ServerXmlHttpRequest();
http.open("POST", deleteUrl, false, "", "");
http.send(stuffToDelete); <---hang
When it hangs, not even the Task Scheduler (with the option enabled to kill the job if it takes longer than 3 minutes to run) can end the task. I have to connect to the remote customer's network, get on the server, and use Task Manager to kill the process.
And then its good for another month or three.
Eventually i started using Task Manager to create a process dump,
so i could analyze where the hang is. After five crash dumps (over the last 11 months or so) i get a consistent picture:
ntdll.dll!_NtWaitForMultipleObjects#20()
KERNELBASE.dll!_WaitForMultipleObjectsEx#20()
user32.dll!MsgWaitForMultipleObjectsEx()
user32.dll!_MsgWaitForMultipleObjects#20()
urlmon.dll!CTransaction::CompleteOperation(int fNested) Line 2496
urlmon.dll!CTransaction::StartEx(IUri * pIUri, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4453 C++
urlmon.dll!CTransaction::Start(const wchar_t * pwzURL, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4515 C++
msxml3.dll!URLMONRequest::send()
msxml3.dll!XMLHttp::send()
Contoso.exe!FrobImporter.TFrobImporter.DeleteFrobs Line 971
Contoso.exe!FrobImporter.TFrobImporter.ImportCore Line 1583
Contoso.exe!FrobImporter.TFrobImporter.RunImport Line 1070
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.HandleFrobImport Line 433
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.CoreExecute Line 71
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.Execute Line 84
Contoso.exe!Contoso.Contoso Line 167
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
So i do a ServerXmlHttpRequest.send, and it never returns. It will sit there for days (causing the system to miss financial transactions, until come Sunday night i get a call that it's broken).
It is of no help unless someone knows how to debug code, but the registers in the stalled thread at the time of the dump are:
EAX 00000030
EBX 00000000
ECX 00000000
EDX 00000000
ESI 002CAC08
EDI 00000001
EIP 732A08A7
ESP 0018F684
EBP 0018F6C8
EFL 00000000
Windows Server 2012 R2
Microsoft IIS/8.5
Default timeouts of ServerXmlHttpRequest
You can use serverXmlHttpRequest.setTimeouts(...) to configure the four classes of timeouts:
resolveTimeout: The value is applied to mapping host names (such as "www.microsoft.com") to IP addresses; the default value is infinite, meaning no timeout.
connectTimeout: A long integer. The value is applied to establishing a communication socket with the target server, with a default timeout value of 60 seconds.
sendTimeout: The value applies to sending an individual packet of request data (if any) on the communication socket to the target server. A large request sent to a server will normally be broken up into multiple packets; the send timeout applies to sending each packet individually. The default value is 30 seconds.
receiveTimeout: The value applies to receiving a packet of response data from the target server. Large responses will be broken up into multiple packets; the receive timeout applies to fetching each packet of data off the socket. The default value is 30 seconds.
The KB305053 (a server that decides to keep the connection open will cause serverXmlHttpRequest to wait for the connection to close) seems like it plausibly could be the issue. But the 30 second default timeout would have taken care of that.
Possible workaround - Add myself to a Job
The Windows Task Scheduler is unable to terminate the task; even though the option is enabled to do do.
I will look into using the Windows Job API to add my self process to a job, and use SetInformationJobObject to set a time limit on my process:
CreateJobObject
AssignProcessToJobObject
SetInformationJobObject
to limit my process to three minutes of execution time:
PerProcessUserTimeLimit
If LimitFlags specifies
JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process
user-mode execution time limit, in 100-nanosecond ticks. Otherwise,
this member is ignored.
The system periodically checks to determine
whether each process associated with the job has accumulated more
user-mode time than the set limit. If it has, the process is
terminated.
If the job is nested, the effective limit is the most
restrictive limit in the job chain.
Although since Task Scheduler uses Job objects to also limit a task's time, i'm not hopeful that the Job Object can limit a job either.
Edit: Job objects cannot limit a process by process time - only user time. And with a process idle waiting for an object, it will not accumulate any user time - certainly not three minutes worth.
Bonus Reading
How can a ServerXMLHTTP GET request hang? (GET, not POST)
KB305053: ServerXMLHTTP Stops Responding When You Send a POST Request (which says the timeout should expire; where mine does not)
MS Forums: oHttp.Send - Hangs (HEAD, not POST)
MS Forums: ASP to test SOAP WebService using MSXML2.ServerXMLHTTP Send hangs
CC to MS Support Forums
Consider switching to a newer, supported API.
msxml6.dll using MSXML2.ServerXMLHTTP.6.0
winhttpcom.dll using WinHttp.WinHttpRequest.5.1.
The msxml3.dll library is no longer supported and is only kept around for compatibility reasons. Plus, there were a number of security and stability improvements included with msxml4.dll (and newer) that you are missing out on.

How to Configure the Web Connector from metrics.log Values

I am reviewing the ColdFusion Web Connector settings in workers.properties to hopefully address a sporadic response time issue.
I've been advised to inspect the output from the metrics.log file (CF Admin > Debugging & Logging > Debug Output Settings > Enable Metric Logging) and use this to inform the adjustments to the settings max_reuse_connections, connection_pool_size and connection_pool_timeout.
My question is: How do I interpret the metrics.log output to inform the choice of setting values? Is there any documentation that can guide me?
Examples from over a 120 hour period:
95% of entries -
"Information","scheduler-2","06/16/14","08:09:04",,"Max threads: 150 Current thread count: 4 Current thread busy: 0 Max processing time: 83425 Request count: 9072 Error count: 72 Bytes received: 1649 Bytes sent: 22768583 Free memory: 124252584 Total memory: 1055326208 Active Sessions: 1396"
Occurred once -
"Information","scheduler-2","06/13/14","14:20:22",,"Max threads: 150 Current thread count: 10 Current thread busy: 5 Max processing time: 2338 Request count: 21 Error count: 4 Bytes received: 155 Bytes sent: 139798 Free memory: 114920208 Total memory: 1053097984 Active Sessions: 6899"
Environment:
3 x Windows 2008 R2 (hardware load balanced)
ColdFusion 10 (update 12)
Apache 2.2.21
Richard, I realize your question here is from 2014, and perhaps you have since resolved it, but I suspect your problem was that the port set in the CF admin (below the "metrics log" checkbox) was set to 8500, which is your internal web server (used by the CF admin only, typically, if at all). That's why the numbers are not changing. (And for those who don't enable the internal web server at installation of CF, or later, most values in the metrics log are null).
I address this problem in a blog post I happened to do just last week: http://www.carehart.org/blog/client/index.cfm/2016/3/2/cf_metrics_log_part1
Hope any of this helps.

Mono error when load testing

During load testing (using Load UI) of a new .Net web api using Mono hosted on a medium sized Amazon server I'm receiving the following results (in chronological order over the course of about ten minutes)
5 connections per second for 60 seconds
No errors
50 connections per second for 60 seconds
No errors
100 connections per second for 60 seconds
Received 3 errors, appearing later during the run
2014-02-07 00:12:10Z Error HttpResponseExtensions Error occured while Processing Request: [IOException] Write failure Write failure|The socket has been shut down
2014-02-07 00:12:10Z Info HttpResponseExtensions Failed to write error to response: {0} Cannot be changed after headers are sent.
5 connections per second for 60 seconds
No errors
100 connections per second for 30 seconds
No errors
100 connections per second for 60 seconds
Received 1 error same as above, appearing later during the run
100 connections per second for 45 seconds
No errors
Doing some research on this, this error seems to be a standard one received when a client closed the connection. As this is only occurring during the heavier load tests, I am wondering if it is just getting to the upper limits of what the server instance can support? If not any suggestions on hunting down the source of the errors?