How to debug and fix intermittent SSL 'connection reset by peer' error? - ssl

We are having an occasional (1 in 100) error appear on our client (CentOS) when connecting to a server (Windows/IIS) over HTTPS.
The error is: SSL: Connection reset by peer.
Running openssl s_client -connect example.com:443 -prexit works 99% of the time but sometimes returns write:errno=104 confirming the connection reset issue.
Interestingly the handshake is a different (smaller) size when the connection is reset and fails but I cannot see how to actually see the handshake.
A successful connection is: SSL handshake has read 5308 bytes and written 319 bytes
A failed connection is: SSL handshake has read 5249 bytes and written 198 bytes
The same protocol (TLS) and cipher is used at all times.
Server side, the error in Windows Event log is: A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. The TLS protocol defined fatal error code is 20. The Windows SChannel error state is 960.
Fatal error code 20 is Received a record with an incorrect MAC. This message is always fatal..
Can anyone help debug this further? As it's only an occasional issue I am struggling to think why it would happen. Thanks!

Not an application error, but most likely a low level error in the infrastructure. Not specific to SSL but to connection oriented sockets. Packet TTL expiring, network route changing or many others. Well written socket code will alway retry a few times before failing. This is very hard to debug becuase it is often not repeatable over short time periods.
Many years ago this error was making me crazy. Did everything I could to track it down, even wrote a monitor to walk the network graph of the system to make sure each node of the graph was functional and responding properly. About a year later the problem disappeared when a switch on the subnet was replaced. The switch was close to the application not to the nodes on the graph in the datacenter.

Related

TLS handshake fail, but communication is not closed

I have TLS program and I did some experiments on it.
I start confidential TLS server session and try to connect to it with pure Telnet client.
As expected, the handshake failed and the server is available to the next client but on the Telnet client side I didn't receive any indication that the handshake failed and that the server is accepting other clients.
I can see in Wireshark that even after the handshake failed the Telnet client can send strings; I see [PSH, ACK] from the client answered by [ACK] from the server.
Adding Wireshark snapshot, Telnet failed the handshake, Telnet keep sending messages, followed by success in the TLS handshake and more Telnet messages:
Why is the server ACKing the Telnet client if the handshake failed and he is accepting other clients?
As expected, the handshake failed ...
I cannot see a failed TLS handshake in the packet capture and I'm not sure how you come to this conclusion.
All I can see that the client on source port 60198 (presumable your telnet) is sending 3 bytes several times and the server just ACK'ing these without sending anything back and without closing the connection. Likely the server is still expecting data in the hope that at some time it will be a complete TLS record. Only then it will be processed by the TLS stack and then it might realize that something is wrong with the client.
... the server is available to the next client
It is pretty normal for a server to handle multiple clients in parallel. In contrary, it would be unusual if the server could not do this.

TCP windows full - intermittent issues in SSL handshake

I am seeing an intermittent SSL handshake error. Looking at the tcp packets, it seems that the times when the SSL handshake fails, TCP options are missing. I have attached the wireshark screenshot for the success and failure scenario. Notice the difference in the [SYN,ACK] packet sent by the server. In success case, it has larger Window size (4380) as compared to 512 when it fails. It also has additional options like MSS and SACK_PERM.
Would anyone know why this would happen? It's the same server but it's sending different capabilities in separate scenarios. Any info in troubleshooting this issue will help. Thanks!

nginx - log SSL handshake failures

I'm running an nginx server with SSL enabled.
My protocol / cipher settings are fairly secure, and I've checked them at ssllabs.com, but --
-- since this is a web service which is called by http clients that I have no control over, I have concerns about compatibility.
To the point:
Is there a way to log SSL handshake failures as they happen (if they happen) in my nginx logs?
For example, I've got SSLv3 disabled, and if I try to "curl -3" (forcing SSlv3) to my server, then I get this:
NSS error -12286 (SSL_ERROR_NO_CYPHER_OVERLAP)
Cannot communicate securely with peer: no common encryption algorithm(s).
Closing connection 0 curl: (35) Cannot communicate securely with peer: no common encryption algorithm(s).
I would like to log this type of error in server logs too, with the default nginx settings, there is nothing.
Enabling "debug" log level for the error log does what I want, will log SSL handshake errors -- but unfortunately it also logs too much other stuff, making the log too bloated, drowning out other potentially useful info.
You can use the info log level.

TLS sslv3 hankshake error

I use a Tcl script to pull from several API's and all of a sudden some API's stopped working. eg:
set data [http_call_get https://api.vineapp.com/timelines/popular?page=1&anchor=1]
responds with the error:
SSL Channel "sock624": error: sslv3 alert handshake failure
It's odd that two of the five API's from different sites stopped working within an hour or each other so I feel that something changed with the compatibility of the tls1.6.3.1 tcl package which binding with "::http::register https 443 ::tls::socket"
I've tried on three different machines (2 x Windows and a ubuntu box).
The sites you are attempting to connect to have probably disabled sslv3 due to the poodle vulnerability.
I would guess your tcl script needs to use TLS instead.

How to track down "Connection timout during SSL handshake" and "Connection closed during ssl handshake" errors

I have recently switched over to HAProxy from AWS ELB. I am terminating SSL at the load balancer (HAProxy 1.5dev19).
Since switching, I keep getting some SSL connection errors in the HAProxy log (5-10% of the total number of requests). There's three types of errors repeating:
Connection closed during SSL handshake
Timeout during SSL handshake
SSL handshake failure (this one happens rarely)
I'm using a free StartSSL certificate, so my first thought was that some hosts are having trouble accepting this certificate, and I didn't see these errors in the past because ELB offers no logging. The only issue is that some hosts have do have successful connections eventually.
I can connect to the servers without any errors, so I'm not sure how to replicate these errors on my end.
This sounds like clients who are going away mid-handshake (TCP RST or timeout). This would be normal at some rate, but 5-10% sounds too high. It's possible it's a certificate issue; I'm not certain exactly how that presents to
Things that occur to me:
If negotiation is very slow, you'll have more clients drop off.
You may have underlying TCP problems which you weren't aware of until your new SSL endpoint proxy started reporting them.
Do you see individual hosts that sometimes succeed and sometimes fail? If so, this is unlikely to be a certificate issue. I'm not sure how connections get torn down when a user rejects an untrusted certificate.
You can use Wireshark on the HAProxy machine to capture SSL handshakes and parse them (you won't need to decrypt the sessions for handshake analysis, although you could since you have the server private key).
I had this happen as well. The following appeared first SSL handshake failure then after switching off option dontlognull we also got Timeout during SSL handshake in the haproxy logs.
At first, I made sure all the defaults timeouts were correct.
timeout connect 30s
timeout client 30s
timeout server 60s
Unfortunately, the issue was in the frontend section
There was a line with timeout client 60 which I only assume means 60ms instead of 60s.
It seems certain clients were slow to connect and were getting kicked out during the SSL handshake. Check your frontend for client timeouts.
How is your haproxy ssl frontend configured ?
For example I use the following to mitigate BEAST attacks :
bind X.X.X.X:443 ssl crt /etc/haproxy/ssl/XXXX.pem no-sslv3 ciphers RC4-SHA:AES128-SHA:AES256-SHA
But some clients seem to generate the same "SSL handshake failure" errors. I think it's because the configuration is too restrictive.