event-drive TLS server - ssl

I'm working on a server-side software that receives requests from clients via TLS (over TCP). For better performance and user experience, I'd like to avoid a full handshake for every request. Ideally, the client can just establish a TLS session with the server for hours, although most of the time the session might be idle. At the same time, high throughput is also required.
One easy way to do it is to dedicate a thread for each session and use a big thread pool to boost throughput. But the performance overhead of this method could be huge, if I want, say, 10s thousands of concurrent sessions.
The requirement of high throughput leads to me the event-driven model. The idea is when the connection is idle (namely no IO event on the underlying socket), the TLS server can switch context to serve other connections. One of the challenges is to sort of freeze the entire TLS session context while the socket is idle and retrieve it when the socket becomes readable/writable.
I'm wondering if there is support already in TLS for this kind of feature? Both cache and ticket seem relevant. Also, I'm wondering if people have implemented this idea.

You are talking about SSL Session resumption, and it is already implemented in both OpenSSL and JSSE, and no doubt every other SSL API you would be using. SSL sessions already survive connections. So there is nothing for you to do to get this.
The part about 'freezing the SSL session context' is completely pointless.

Related

Extra TCP connections on the RabbitMQ server after resource alarm

I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
   1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
   2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients. 
   3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?
Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ
I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.

Avoid concurrent SSL/TLS negotiation of first request and subsequent resource

I configured an instance on MS Azure and implemented a SSL certificate to the domain. When I tested performance (on 3G), I noticed that the SSL negotiation is happening on two requests concurrently, which (unnecessarily) prolongs the process and results in a 0.6s time to complete.
Does anyone have a tip how to solve/avoid this? It's adding an extra 0.3 seconds to the rendering path and I feel it can be avoided.
I would look closely on 'SSL/TLS Sessions' - this is exactly what you need to decrease the negotiations time. Do not confuse them with HTTP sessions.

Why is mod_proxy_protocol or ELB causing high apache worker count?

We have a legacy cluster of servers running Apache 2.4 that run our application sitting behind an ELB. This ELB has two listeners, one HTTP, and one HTTPS which terminates at the ELB and sends regular HTTP traffic to the instances behind it. This ELB also has pre-open turned off (it was causing a busy worker buildup). Under normal load we have 1-3 busy workers per instance.
We have a new cluster of servers we are trying to migrate to behind a new ELB. The purpose of this migration is to allow for SNI – serving TLS traffic to thousands of domains. As such this cluster uses mod_proxy_protocol which has been enabled at the ELB level. For the purposes of testing we’ve been weighting traffic at the DNS (Route 53) level to send 30% of our traffic to the new load balancer. Under even this small load we see 5 – 10 busy workers and that grows as traffic does.
As a further test we took one of these new instances, disabled proxy_protocol, and moved it from the new ELB to the old ELB, the worker count drops to average levels, being 1-3 busy workers. This seems to indicate that there is an issue wither with the ELB (differences between HTTP and TCP handling?) or mod_proxy_protocol.
My question: Why is it that we have twice the busy apache workers when using proxy protocol and the new ELB? I would think that since TCP listeners are dumb and don’t do any processing on the traffic, they would be faster and as a result consume less workers time than HTTP listeners which actively ‘modify’ the traffic going thru them.
Any guidance to help us diagnose this issue is appreciated.
The difference is simple and significant:
An ELB in HTTP mode takes care of holding the idle keep-alive connections from browsers without holding open corresponding connections to the instance. There's no necessary correlation between browser connections and back-end connections -- a backend connection can be reused.
In TCP mode, it's 1:1. It has to be, because the ELB can't reuse a back-end connection for different browser connection on the front-end -- it's not interpreting what's going down the pipe. That's always true for TCP, but if the reason isn't intuitive, it should be particularly obvious with the proxy protocol enabled. The PROXY "header" is not in fact a "header" in the usual sense -- it's a preamble. It can only be sent at the very beginning of a connection, identifying the source address and port. The connection persists until the browser or server closes it, or it times out. It's 1:1.
This is not likely to be viable with Apache.
Back to HTTP mode, for a minute.
This ELB also has pre-open turned off (it was causing a busy worker buildup).
I don't know how you did that -- I've never seen it documented, so I assume this must have been through a support request.
This seems like a case of solving entirely the wrong problem. Instead of having a number of connections that seems to you to be artificially high, all you've really accomplished is keeping the number of connections artificially low -- ultimately, you're probably actually impairing your performance and ability to scale. Those spare connections are for the purpose of handling bursts of demand. If your instance is too small to handle them, then I would suggest that the real problem is just that: your instance is too small.
Another approach -- which is exactly the solution I use for my dreaded legacy Apache-based applications (one of which has a single Apache server sitting behind a total of about 15 to 20 different ELBs -- necessary because each ELB is offloading SSL using a certificate provided by one of the old platform's customers) -- is HAProxy between the ELBs and Apache. HAProxy can handle literally hundreds of connections and millions of requests per day on tiny instances (I'm talking tiny -- t2.nano and t2.micro), and it has no problem keeping the connections alive from all of the ELBs yet closing the Apache connection after each request... so it's optimizing things in both directions.
And of course, you can also use HAProxy with a TCP balancer and the proxy protocol -- the author of HAProxy was also the creator of the proxy protocol standard. You can also just run it on the instances with Apache rather than on separate instances. It's lightweight in memory and CPU and doesn't fork. I'm not affilated with the project, other than having submitted occasional bug reports during the development of the Lua integration.

SSL and Load Balancing [duplicate]

What affect does SSL have on the way load balancing works? I know that you need to use sticky sessions if you have chosen to not store you session info in the DB or Out of Process but how does that effect SSL?
Just to clarify, the SSL/TLS sessions have nothing to do with the HTTP sessions. (Some implementations may use the SSL/TLS session ID as a basis for maintaining HTTP sessions, but this is a bad design, as SSL/TLS may change sessions completely independently what HTTP is doing).
In terms of load balancing, you get a couple of options:
Use a load-balancer that is your SSL/TLS endpoint. In this case, the load-balancing will be done at the HTTP level: the client connects to the load-balancer and the load-balancer unwraps the SSL/TLS connection to pass on the HTTP content (then in clear) to its workers.
Use a load-balancer at the TCP/IP level, which redirects entire the TCP connection directly to a worker node. In this case, each worker node would have to have the certificate and private key (which isn't necessarily a problem if they're administered consistently). Using this technique, the load balancer doesn't do any HTTP processing at all (since it doesn't look within the SSL/TLS connection): on the one hand this reduces the processing done by the load-balancer itself, on the other hand it would prevent you from dispatching to a particular worker node based on the URL structure for example. Both methods have their advantages and disadvantages.

SSL and Load Balancing

What affect does SSL have on the way load balancing works? I know that you need to use sticky sessions if you have chosen to not store you session info in the DB or Out of Process but how does that effect SSL?
Just to clarify, the SSL/TLS sessions have nothing to do with the HTTP sessions. (Some implementations may use the SSL/TLS session ID as a basis for maintaining HTTP sessions, but this is a bad design, as SSL/TLS may change sessions completely independently what HTTP is doing).
In terms of load balancing, you get a couple of options:
Use a load-balancer that is your SSL/TLS endpoint. In this case, the load-balancing will be done at the HTTP level: the client connects to the load-balancer and the load-balancer unwraps the SSL/TLS connection to pass on the HTTP content (then in clear) to its workers.
Use a load-balancer at the TCP/IP level, which redirects entire the TCP connection directly to a worker node. In this case, each worker node would have to have the certificate and private key (which isn't necessarily a problem if they're administered consistently). Using this technique, the load balancer doesn't do any HTTP processing at all (since it doesn't look within the SSL/TLS connection): on the one hand this reduces the processing done by the load-balancer itself, on the other hand it would prevent you from dispatching to a particular worker node based on the URL structure for example. Both methods have their advantages and disadvantages.