My IIS 7.5 web server farm (2xWindows 2008 R2 physical servers using Network Load Balancing) is experiencing heavy server use and SSL/TLS requests to port 443 are timing out on what appears to be the TLS negotiation (500+ Get Requests/sec with over 20K Current Connections).
Despite the heavy load, the performance of the server hardware is fine--less than 20% processor utilization, 75% of memory still available, and virtually no processor queuing. Additionally, the bandwidth utilization is fine as well. However, during this heavy usage event, my websites stopped responding to SSL-based (https) requests and clients were unable to negotiate a TLS connection. During this same time, requests using http to the same websites were working fine and the websites were very responsive (I disabled the IIS rewrite rule from http to https). The problem may have gone away after I uninstalled my CA issued certificate and reinstalled the same one and then restarted all web services however I can't say for sure that this corrected it because I also stopped forcing the use of SSL.
In troubleshooting, the only thing I see is that my Windows event logs are filled with Event ID: 36887, which seems to be related to SSL but the meaning of the error is vague to me. This is the description of the error message:
"This error message indicates the computer received an SSL fatal alert message from the server ( It is not a bug in the Schannel or the application that uses Schannel). Sometimes is caused by the installation of third party web browser (other than Internet Explorer)."
There are hundreds of entries per minute corresponding to the time of the performance issues. After this occurred, I was told to enable the CAPI2 log but since the issue is not occurring now, I only see informational messages in this log.
What would cause this problem with TLS unable to negotiate a connection under a heavy load in my networked balanced web farm and how I can prevent this from occurring again?
Related
I have the following setup:
Service Fabric cluster running 5 machines, with several services running in Docker containers
A public IP which has port 443 open, forwarding to the service running Traefik
Traefik terminates the SSL, and proxies the request over to the service being requested over HTTP
Here's the behavior I get:
The first request to https:// is very, very slow. Chrome will usually eventually load it after a time timeouts or "no content" errors. Invoke-WebRequest in Powershell usually just times out with an "The underlying connection was closed" message.
However, once it loads, I can refresh things or run the command again and it responds very, very quickly. It'll work as long as there's regular traffic to the URL.
If I leave for a bit (not sure on the time, definitely a few minutes) it dies and goes back to the beginning.
My Question:
What would cause SSL handshakes to just break or take forever? What component in this stack is to blame? Is something in Service Fabric timing out? Is it a Traefik thing? I could switch over to Nginx if it's more stable. We use these same certs on IIS, and we don't have this problem.
I could use something like New Relic to constantly send a ping every minute to keep things alive, but I'd rather figure out why the connection is dying after a few minutes.
What are the best ways to go about debugging this? I don't see anything in the Traefik log files (In DEBUG mode), in fact when it doesn't connect, there's no record of the request in the access logs at all. Any tools that could help debug this? Thanks!
Is the Traefik service healthy on all 5 nodes, can you inspect the logs of all 5 instances? If not this might cause the Azure Load Balancer to load balance across nodes where Traefik is not listening which would cause intermittent and slow responses. Once a healthy Traefik responds, you'll get a sticky session cookie which will then make subsequent responses faster. You can enable ApplicationInsights monitoring for Traefik logs to save you crawling across all the machines: https://github.com/jjcollinge/traefik-on-service-fabric#debugging. I'd also recommend testing this without SSL to ensure Traefik can route correctly over HTTP first and then add HTTPS. That way you'll know it's something to do with the SSL configuration (i.e. mounted the certificates correctly, Traefik toml config, trusted certificates, etc.)
We are attempting to allow a client to access one of our QA environments. They are seeing the following error in IE:
This page can't be displayed
Turn on TLS 1.0, TLS 1.1, and TLS 1.2 in Advanced settings and try connecting to https://oursite.com again. if this error persists, it is possible that this site uses an unsupported protocol or cipher suite such as RC4 (link for the details), which is not considered secure. Pelase contact your site administrator.
I am not asking stackoverflow users to solve this problem.
I am asking the following very specific question:
Because we are seeing this error, does this prove that connectivity exists, i.e. our firewall is letting them through? I am thinking if they were blocked at the firewall they would simply get a timeout or perhaps a 403 or 500 error. since they are getting so far as to be able to see what TLS protocols are supported on the web server, I infer that they must be able to communicate with it on OSI levels 1-4. Am I correct? (I need to know whether to engage the networking team, which runs the firewalls, or to engage the application support team, which sets up the TLS configuration).
Note that SSL terminates on our IIS web server (we don't have SSL offloading).
Unfortunately we have port 80 blocked so we can only test on 443; otherwise I would suggest using http access to help isolate the problem.
... if they were blocked at the firewall they would simply get a timeout or perhaps a 403 or 500 error.
In order to send back a 403 or 500 error the firewall must have successfully done the SSL handshake with the client because the HTTP response (which includes the status code, i.e. 403, 500..) will only be sent inside the encrypted connection. There is no way to return a 403 or 500 inside the SSL handshake already.
Typical behavior with a firewall in between would be a timeout (firewalls drops packet) or more likely a connection reset or close (firewall resets or closes the connection). With a simple packet filter firewall it will usually block the TCP connection already, resulting in connection refused. But a firewall using DPI might actually let the TCP connection establish and only block after it gets actual data based on the content of this payload (i.e. application detection).
The last case might result in the error you see. But exactly the same behavior can be seen if there is a problem on the server side where the server simply closes or resets the connection. Some TLS stacks show such behavior (instead of sending back a TLS alert) when they cannot find a shared protocol version or cipher. Insofar you can neither conclude from this error message that the firewall is blocking the connection nor can you conclude that the server is causing the error.
I have an client application that connects to a remote server via https for commercial purposes. This connection is using old IO (blocking connection). It normally runs smoothly.
Recently I have cloned the client thus created a new client instance, running from the same box and using the same client certificate. I'm noticing many connection timeouts from the server. I wonder if the cloning may have somehow been the cause of the timeouts and if there is a ssl issue here.
Both instances receive the following system parameters for security:
javax.net.ssl.trustStore=cacerts
javax.net.ssl.keyStore=1234567890123
javax.net.ssl.keyStorePassword=wordpass
Unfortunately the support from the server side is quite limited. I hope someone in this forum may come up with an idea.
Our Application (which uses existing Erlang OTP R15B01 modules) sends https request to external authentication server and it gets reply and seems work fine under normal cases. But under heavy loads some requests are failing since they are consuming more time to do SSL handshake.
I have observed the following things during SSL handshake:
client is taking (our application) nearly 80 sec to send the certificate after server hello is done with server certificate
since our server expects to complete the request-response in 30 sec otherwise it drops the connection hence results in connection failures and affects the performance of application severely
Finally, I would like to know:
Is our application failing to invoke the client certificate quickly? I mean does httpc module do the file/IO related operations to invoke the certificates which results to slow response under heavy loads?
Does Erlang have any limitations in SSL handshake procedure?
On a customer's internal network, I can make a request to my SSL site using IE6 SP1 (on Win2K) and see one cert validation requests, but if I use IE6 SP2 (on XP) 13 separate cert validation requests get fired off. Needless to say, this slows down my page load a lot.
Firefox loads the page just fine with no unnecessary cert validation requests.
The server is Apache running a pretty new lampp stack. All the server certificate / CA chain configurations seem to be fine (users can authenticate w/ trusted certs, the system can communicate to other systems with that server cert, etc.)
Is there anything I can do from a configuration standpoint? Any other ideas at all?
I'm guessing that "upgrade IE" is off the table, right? You're probably trying to find a way to support IE 6.0, SP2, with XP, so your users can use this version.
OK... one thought is trying to mess with the SSL configuration. As I remember, SSL has a number of settings that can be used and perhaps you can change one of them on your server and get a different behavior. It might be worth it to research what's happening during the SSL Handshake on the working and not-working versions of IE 6.0. I favor Ethereal, a free network traffic watching tool that will capture the SSL. It can't decrypt it easily, but you can at least see the first few messages that happen in the clear. It might give an inkling into why all these validation requests are coming in.