F5 SSL session cache reuse with a spring boot application - ssl

We are running a spring boot application with embedded tomcat. The application on its own works fine.
When we introduced a F5 LOAD Balancer to load-balance the traffic, our throughput went down.
When we connect to the server directly we can see that during handshake, the server reuses the cached session.
When the calls are routed from F5, a new session is always created.
What can be the cause for this. This was working well and has broken suddenly. We suspect our f5 configuration is broken. Any pointers for further investigation would be helpful.
%% Initialized: [Session-22, SSL_NULL_WITH_NULL_NULL]
Standard ciphersuite chosen: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
%% Negotiating: [Session-22, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]
*** ServerHello, TLSv1.2

I suspect your SSL session caches and/or timeout profiles may not be configured past defaults or are configured too low for your application.
Check and update your session cache sizes and timeouts to acceptable levels.
K6767: Overview of the BIG-IP SSL session cache profile settings
You may also need to check other timeout settings to verify the entire tcp session isn't being reset due to idle timeouts. Without knowing how your application pins up and tears down traffic, it's difficult to say.
K7606: Overview of BIG-IP idle session time-outs
I suspect the first KB I linked is going to solve your issue but it's always safe to check all timeout settings across profiles used by that virtual server. Hope this gets you in the right direction.

From K6767 mentioned earlier, there's a link to :
K13854: The BIG-IP system incorrectly allocates the SSL session cache size based on per virtual
This explains why it is necessary to shorten SSL Cache size when a same profile is called by several Virtual Servers.

Related

Apache threads stay in state reading after queries

My configuration is apache and tomcat behind an aws elb. Apache is configured with no keepalive, and set with max clients to a low number due to each query being very cpu intensive. I'll load test the machine with queries. Then the number of available requests goes to zero as can be seen by curl -s localhost/server-status?auto not responding immediately. When I stop the loadtest I can see the scoreboard from curl -s localhost/server-status?auto still is full of R's even though from the tomcat logs it is clear nothing is happening. Does anyone have an idea on what possible causes there might be?
If your apache display 'R' in the status it means there is open TCP connections from ELB to apache (just an open TCP connection, no data sent yet).
There is no official complete documentation on this subject (how the numbers of pre-opened connections is optimized ), but the amazon documentation state (at this page: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html ) that:
Classic Load Balancers use pre-open connections but Application Load Balancers do not.
So, the answer is: it is an optimization from amazon (TCP connection cost a little bit to be openned).

Apache mod_wsgi slowloris DoS protection

Assuming the following setup:
Apache server 2.4
mpm_prefork with default settings (256 workers?)
Default Timeout (300s)
High KeepAliveTimeout (100s)
reqtimeout_mod enabled with the following config: RequestReadTimeout header=62,MinRate=500 body=62,MinRate=500
Outdated mod_wsgi 3.5 using Daemon mode with 15 threads and 1 process
AWS ElasticBeanstalk's load balancer acting as a reverse proxy to apache with 60s idle connection timeout
Python/Django being the wsgi application
A simple slowloris attack like the one described here, using a "slow" request body: https://www.blackmoreops.com/2015/06/07/attack-website-using-slowhttptest-in-kali-linux/
The above attack, with just 15 requests (same as mod_wsgi threads) can easily lock the server until a timeout happens, either due to:
Load balancer timeout (60s) happens due to no data sent, this kills the apache connection and mod_wsgi can once again serve requests
Apache RequestReadTimeout happens due to data being sent, but not enough, again mod_wsgi is able to serve requests after this
However, with just 15 concurrent "slow" requests, I was able to lock the server up to 60 seconds.
Repeating the same but with a more bizarre number, like 4096 requests, pretty much locks the server permanently since there will be always a new request that needs to be served by mod_wsgi once the previous times out.
I would expect that the load balancer should handle/detect this before even sending requests to apache, which it already does for similar attacks (partial headers, or tcp syn flood attacks never hit apache which is nice)
What options are available to help against this? I know there's no failproof option since these kind of attacks are difficult to detect and protect, but it's quite silly that the server can be locked that easily.
Also, if the wsgi application never reads request body, I would expect for the issue to not happen as well since the request should return immediately, but I'm not sure about this or the internals of mod_wsgi, for example, this is true when using a local dev wsgi server (the attack files since the request body is never read) but the attack succeeds when using mod_wsgi, which leads me to think it tries to read the body even before sending it to the wsgi code.
Slowloris is a very simple Denial-of-Service attack. This is easy to detect and block.
Detecting and preventing DoS and DDos attacks are complex topics with many solutions. In your case you are making the situation worse by using outdated software and picking a low worker thread count so that the problem arises quickly.
A combination of services are available that would be used to manage Dos and DDos attacks.
The front-end of the total system would be protected by a firewall. Typically this firewall would include a Web Application Firewall to understand the nuances of HTTP protocols. In the AWS world, Amazon WAF and Shield are commonly used.
Another service that helps is a CDN. Amazon CloudFront uses Amazon Shield so it has good DDoS support.
The next step is to combine load balancers with auto scaling mechanisms. When the health checks start to fail (caused by Slowloris), the auto scaler will begin launching new instances and terminating failed instances. However, a sustained Slowloris attack will just hit the new servers. This is why the Web Application Firewall needs to detect the attack and start blocking it.
For your studies, take a look at mod_reqtimeout. This is an effective and tuneable solution for Apache for most Slowloris attacks.
[Update]
In the Amazon DDoS White Paper June 2015, Slowloris is specifically mentioned.
On AWS, you can use Amazon CloudFront and AWS WAF to defend your
application against these attacks. Amazon CloudFront allows you to
cache static content and serve it from AWS Edge Locations that can
help reduce the load on your origin. Additionally, Amazon CloudFront
can automatically close connections from slow-reading or slow-writing
attackers (e.g., Slowloris).
Amazon DDoS White Paper June 2015
In mod_wsgi daemon mode there are a bunch of options to further help to combat such attacks by recovering from it and discarding queued requests as well which have been waiting too long. Try your tests using mod_wsgi-express as it defines defaults for a lot of these options whereas when using mod_wsgi yourself directly, there are no defaults. Use mod_wsgi-express start-server --help to see what defaults are. The actual options you want to look at for mod_wsgi daemon mode are request-timeout, connect-timeout, socket-timeout and queue-timeout. There are also other options related to buffer sizes and listener backlog you can play with. Do note that ultimately the listen backlog of the main Apache worker processes can still be an issue because it usually defaults to 500, which means a lot of requests can queue up stuck before you can even tag them with a time so as to help discard the backlog by tracking queue time.
You can find the documentation at:
http://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
On the point of whether mod_wsgi reads the request body before sending it, no it doesn't. Apache itself because it reads in block may partially read the request body when reading the headers, but it shouldn't block on it. Once the full request headers are passed off to mod_wsgi and sent through to the daemon process, then mod_wsgi will start transferring the request body.
Soloution:
If you are getting hit, I recommend you go to a provider that protects against DDoS attacks. However your best bet would be to programatically block the IP once it has been decided that it is being malicious. If you receive two large Content-Length POST requests than you should block the IP for a few minutes for suspicious activities. Many large companies are very cheap, and some of them are free for the basic package such as Cloud Flare. I use them for my company and I am beyond happy to have them!
Edit: Their job is literally just to protect you. That is it.

event-drive TLS server

I'm working on a server-side software that receives requests from clients via TLS (over TCP). For better performance and user experience, I'd like to avoid a full handshake for every request. Ideally, the client can just establish a TLS session with the server for hours, although most of the time the session might be idle. At the same time, high throughput is also required.
One easy way to do it is to dedicate a thread for each session and use a big thread pool to boost throughput. But the performance overhead of this method could be huge, if I want, say, 10s thousands of concurrent sessions.
The requirement of high throughput leads to me the event-driven model. The idea is when the connection is idle (namely no IO event on the underlying socket), the TLS server can switch context to serve other connections. One of the challenges is to sort of freeze the entire TLS session context while the socket is idle and retrieve it when the socket becomes readable/writable.
I'm wondering if there is support already in TLS for this kind of feature? Both cache and ticket seem relevant. Also, I'm wondering if people have implemented this idea.
You are talking about SSL Session resumption, and it is already implemented in both OpenSSL and JSSE, and no doubt every other SSL API you would be using. SSL sessions already survive connections. So there is nothing for you to do to get this.
The part about 'freezing the SSL session context' is completely pointless.

Why is mod_proxy_protocol or ELB causing high apache worker count?

We have a legacy cluster of servers running Apache 2.4 that run our application sitting behind an ELB. This ELB has two listeners, one HTTP, and one HTTPS which terminates at the ELB and sends regular HTTP traffic to the instances behind it. This ELB also has pre-open turned off (it was causing a busy worker buildup). Under normal load we have 1-3 busy workers per instance.
We have a new cluster of servers we are trying to migrate to behind a new ELB. The purpose of this migration is to allow for SNI – serving TLS traffic to thousands of domains. As such this cluster uses mod_proxy_protocol which has been enabled at the ELB level. For the purposes of testing we’ve been weighting traffic at the DNS (Route 53) level to send 30% of our traffic to the new load balancer. Under even this small load we see 5 – 10 busy workers and that grows as traffic does.
As a further test we took one of these new instances, disabled proxy_protocol, and moved it from the new ELB to the old ELB, the worker count drops to average levels, being 1-3 busy workers. This seems to indicate that there is an issue wither with the ELB (differences between HTTP and TCP handling?) or mod_proxy_protocol.
My question: Why is it that we have twice the busy apache workers when using proxy protocol and the new ELB? I would think that since TCP listeners are dumb and don’t do any processing on the traffic, they would be faster and as a result consume less workers time than HTTP listeners which actively ‘modify’ the traffic going thru them.
Any guidance to help us diagnose this issue is appreciated.
The difference is simple and significant:
An ELB in HTTP mode takes care of holding the idle keep-alive connections from browsers without holding open corresponding connections to the instance. There's no necessary correlation between browser connections and back-end connections -- a backend connection can be reused.
In TCP mode, it's 1:1. It has to be, because the ELB can't reuse a back-end connection for different browser connection on the front-end -- it's not interpreting what's going down the pipe. That's always true for TCP, but if the reason isn't intuitive, it should be particularly obvious with the proxy protocol enabled. The PROXY "header" is not in fact a "header" in the usual sense -- it's a preamble. It can only be sent at the very beginning of a connection, identifying the source address and port. The connection persists until the browser or server closes it, or it times out. It's 1:1.
This is not likely to be viable with Apache.
Back to HTTP mode, for a minute.
This ELB also has pre-open turned off (it was causing a busy worker buildup).
I don't know how you did that -- I've never seen it documented, so I assume this must have been through a support request.
This seems like a case of solving entirely the wrong problem. Instead of having a number of connections that seems to you to be artificially high, all you've really accomplished is keeping the number of connections artificially low -- ultimately, you're probably actually impairing your performance and ability to scale. Those spare connections are for the purpose of handling bursts of demand. If your instance is too small to handle them, then I would suggest that the real problem is just that: your instance is too small.
Another approach -- which is exactly the solution I use for my dreaded legacy Apache-based applications (one of which has a single Apache server sitting behind a total of about 15 to 20 different ELBs -- necessary because each ELB is offloading SSL using a certificate provided by one of the old platform's customers) -- is HAProxy between the ELBs and Apache. HAProxy can handle literally hundreds of connections and millions of requests per day on tiny instances (I'm talking tiny -- t2.nano and t2.micro), and it has no problem keeping the connections alive from all of the ELBs yet closing the Apache connection after each request... so it's optimizing things in both directions.
And of course, you can also use HAProxy with a TCP balancer and the proxy protocol -- the author of HAProxy was also the creator of the proxy protocol standard. You can also just run it on the instances with Apache rather than on separate instances. It's lightweight in memory and CPU and doesn't fork. I'm not affilated with the project, other than having submitted occasional bug reports during the development of the Lua integration.

Glassfish failover without load balancer

I have a Glassfish v2u2 cluster with two instances and I want to to fail-over between them. Every document that I read on this subject says that I should use a load balancer in front of Glassfish, like Apache httpd. In this scenario failover works, but I again have a single point of failure.
Is Glassfish able to do that fail-over without a load balancer in front?
The we solved this is that we have two IP addresses which both respond to the URL. The DNS provider (DNS Made Easy) will round robin between the two. Setting the timeout low will ensure that if one server fails the other will answer. When one server stops responding, DNS Made Easy will only send the other host as the server to respond to this URL. You will have to trust the DNS provider, but you can buy service with extremely high availability of the DNS lookup
As for high availability, you can have cluster setup which allows for session replication so that the user won't loose more than potentially one request which fails.
Hmm.. JBoss can do failover without a load balancer according to the docs (http://docs.jboss.org/jbossas/jboss4guide/r4/html/cluster.chapt.html) Chapter 16.1.2.1. Client-side interceptor.
As far as I know glassfish the cluster provides in-memory session replication between nodes. If I use Suns Glassfish Enterprise Application Server I can use HADB which promisses 99.999% of availability.
No, you can't do it at the application level.
Your options are:
Round-robin DNS - expose both your servers to the internet and let the client do the load-balancing - this is quite attractive as it will definitely enable fail-over.
Use a different layer 3 load balancing system - such as "Windows network load balancing" , "Linux Network Load balancing" or the one I wrote called "Fluffy Linux cluster"
Use a separate load-balancer that has a failover hot spare
In any of these cases you still need to ensure that your database and session data etc, are available and in sync between the members of your cluster, which in practice is much harder.