HTTPS connection stops working after a few minutes - ssl

I have the following setup:
Service Fabric cluster running 5 machines, with several services running in Docker containers
A public IP which has port 443 open, forwarding to the service running Traefik
Traefik terminates the SSL, and proxies the request over to the service being requested over HTTP
Here's the behavior I get:
The first request to https:// is very, very slow. Chrome will usually eventually load it after a time timeouts or "no content" errors. Invoke-WebRequest in Powershell usually just times out with an "The underlying connection was closed" message.
However, once it loads, I can refresh things or run the command again and it responds very, very quickly. It'll work as long as there's regular traffic to the URL.
If I leave for a bit (not sure on the time, definitely a few minutes) it dies and goes back to the beginning.
My Question:
What would cause SSL handshakes to just break or take forever? What component in this stack is to blame? Is something in Service Fabric timing out? Is it a Traefik thing? I could switch over to Nginx if it's more stable. We use these same certs on IIS, and we don't have this problem.
I could use something like New Relic to constantly send a ping every minute to keep things alive, but I'd rather figure out why the connection is dying after a few minutes.
What are the best ways to go about debugging this? I don't see anything in the Traefik log files (In DEBUG mode), in fact when it doesn't connect, there's no record of the request in the access logs at all. Any tools that could help debug this? Thanks!

Is the Traefik service healthy on all 5 nodes, can you inspect the logs of all 5 instances? If not this might cause the Azure Load Balancer to load balance across nodes where Traefik is not listening which would cause intermittent and slow responses. Once a healthy Traefik responds, you'll get a sticky session cookie which will then make subsequent responses faster. You can enable ApplicationInsights monitoring for Traefik logs to save you crawling across all the machines: https://github.com/jjcollinge/traefik-on-service-fabric#debugging. I'd also recommend testing this without SSL to ensure Traefik can route correctly over HTTP first and then add HTTPS. That way you'll know it's something to do with the SSL configuration (i.e. mounted the certificates correctly, Traefik toml config, trusted certificates, etc.)

Related

Is there any way to increase the cloudflare proxy request timeout limit(524)? [duplicate]

Is it possible to increase CloudFlare's time-out? If yes, how?
My code takes a while to execute and I wasn't planning on Ajaxifying it the coming days.
No, CloudFlare only offers that kind of customisation on Enterprise plans.
CloudFlare will time out if it fails to establish a HTTP handshake after 15 seconds.
CloudFlare will also wait 100 seconds for a HTTP response from your server before you will see a 524 timeout error.
Other than this there can be timeouts on your origin web server.
It sounds like you need Inter-Process Communication. HTTP should not be used a mechanism for performing blocking tasks without sending responses, these kind of activities should instead be abstracted away to a non-HTTP service on the server. By using RabbitMQ (or any other MQ) you can then pass messages from the HTTP element of your server over to the processing service on your webserver.
I was in communication with Cloudflare about the same issue, and also with the technical support of RabbitMQ.
RabbitMQ suggested using Web Stomp which relies on Web Sockets. However Cloudflare suggested...
Websockets would create a persistent connection through Cloudflare and
there's no timeout as such, but the best way of resolving this would
be just to process the request in the background and respond asynchronously, and serve a 'Loading...' page or similar, rather than having the user to wait for 100 seconds. That would also give a better user experience to the user as well
UPDATE:
For completeness, I will also record here that
I also asked CloudFlare about running the report via a subdomain and "grey-clouding" it and they replied as follows:
I will suggest to verify on why it takes more than 100 seconds for the
reports. Disabling Cloudflare on the sub-domain, allow attackers to
know about your origin IP and attackers will be attacking directly
bypassing Cloudflare.
FURTHER UPDATE
I finally solved this problem by running the report using a thread and using AJAX to "poll" whether the report had been created. See Bypassing CloudFlare's time-out of 100 seconds
Cloudflare doesn't trigger 504 errors on timeout
504 is a timeout triggered by your server - nothing to do with Cloudflare.
524 is a timeout triggered by Cloudflare.
See: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#502504error
524 error? There is a workaround:
As #mjsa mentioned, Cloudflare only offers timeout settings to Enterprise clients, which is not an option for most people.
However, you can disable Cloudflare proxing for that specific (sub)domain by turning the orange cloud into grey:
Before:
After:
Note: it will disable extra functionalities for that specific (sub)domain, including IP masking and SSL certificates.
As Cloudflare state in their documentation:
If you regularly run HTTP requests that take over 100 seconds to
complete (for example large data exports), consider moving those
long-running processes to a subdomain that is not proxied by
Cloudflare. That subdomain would have the orange cloud icon toggled to
grey in the Cloudflare DNS Settings . Note that you cannot use a Page
Rule to circumvent Error 524.
I know that it cannot be treated like a solution but there is a 2 ways of avoiding this.
1) Since this timeout is often related to long time generating of something, this type of works can be done through crontab or if You have access to SSH you can run a PHP command directly to execute. In this case connection is not served through Cloudflare so it goes as long as your configuration allows it to run. Check it on Google how to run scripts from command line or how to determine them in crontab by using /usr/bin/php /direct/path/to/file.php
2) You can create subdomain that is not added to cloudlflare and move Your script there and run them directly through URL, Ajax call or whatever.
There is a good answer on Cloudflare community forums about this:
If you need to have scripts that run for longer than around 100 seconds without returning any data to the browser, you can’t run these through Cloudflare. There are a couple of options: Run the scripts via a grey-clouded subdomain or change the script so that it kicks off a long-running background process and quickly returns a status which the browser can poll until the background process has completed, at which point the full response can be returned. This is the way most people do this type of action as keeping HTTP connections open for a long time is unreliable and can be very taxing also.
This topic on Stackoverflow is high in SERPs so I decided to write down this answer for those who will find it usefull.
https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#502504error
Cloudflare 524 error results from a web page taking more than 100 seconds to completely respond.
This can be overridden to (up to) 600 seconds ... if you change to "Enterprise" Cloudflare account. The cost of Enterprise is roughtly $40k per year (annual contract required).
If you are getting your results with curl, you could use the resolve option to directly access your IP, not using the Cloudflare proxy IP:
For example:
curl --max-time 120 -s -k --resolve lifeboat.com:443:127.0.0.1 -L https://lifeboat.com/blog/feed
The simplest way to do this is to increase your proxy waiting timeout.
If you are using Nginx for instance you can simply add this line in your /etc/nginx/sites-availables/your_domain:
location / {
...
proxy_read_timeout 600s; # this increases it by 10mins; feel free to change as you see fit with your needs.
...
}
If the issue persists, make sure you use let's encrypt to secure your server alongside Nginx and then disable the orange cloud on that specific subdomain on Cloudflare.
Here are some resources you can check to help do that
installing-nginx-on-ubuntu-server
secure-nginx-with-let's-encrypt

Service Temporarily Unavailable under load in OpenShift Enterprise 2.0

Using OpenShift Enterprise 2.0, I have a simple jbossews (tomcat7) + mysql 5.1 app that uses JSP files connected to a mysql database. The app was created as a non-scaled app (fwiw the same issue happens when scaling is enabled).
Using a JMeter driver with only a single concurrent user and no think time, it will chug along for about 2 minutes (at about 200 req/sec) and then it will start returning "503 Service Temporarily Unavailable" in batches (a few seconds at a time) on and off for the remainder of the test. Even if I change nothing (don't restart the app) if I wait a moment and then try again, it will do the same thing--first it seems fine, but then it will start with the errors.
The gear is far from fully-utilized (memory/cpu), and the only log I can find that shows a problem is the /var/log/httpd/error_log, which fills up with these entries:
[Tue Mar 25 15:51:13 2014] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.8.162.129:8080 (*) failed
Looking at the 'top' command on the node host at the time that the errors start to occur, I see several httpd processes surge to the top on and off.
So it looks like I am somehow running out of proxy connections or something similar. However, I'm not sure how that is happening with only a single concurrent user. Any ideas of how to fix this? I couldn't find any similar posts.
The core problem is that the system is running out of ephemeral ports due to connections stuck in TIME_WAIT. Check using:
netstat -pan --tcp | less
or
netstat -pan --tcp | grep -c ".*TIME_WAIT"
to just count the number of connections in time wait state.
These are connections made by the node port proxy (httpd) to the tomcat backend. There are several ways to change TCP settings in order to lessen the problem. First attempt is to enable reuse. Append the following to /etc/sysctl.conf:
# allow reuse of time_wait connections
net.ipv4.tcp_tw_reuse=1
This will allow connections in TIME_WAIT state to be reused if there are no ephemeral ports available.
However,the problem mostly remains that these connections are not being properly pooled. I do not run into this issue outside of a gear with the same app+driver--meaning that the connections are properly pooled and don't have to sit in TIME_WAIT state at all. Something in the proxy must be interfering with the connection closure.
Looks like the mod_proxy / mod_rewrite are not configured for connection pooling/keepalive or they are not compatible with it.
You should first try and move to vhost routing, if your hitting this issue, but tcp tw reuse can help if vhost connection are still so high that you still run out of port.
https://access.redhat.com/articles/1203843 also has a lot of good information on the topic, including this on possible causes of error 503:
Understanding that HAProxy preforms health checks on Gear
(application) contexts is important because if these checks fail you
can see 502 or 503 errors when trying to access your application,
because the Proxy disables the route to the application (IE: puts the
gear in maintenance mode).
...and...
...if you are seeing 502 or 503 errors when trying to access your
application, it could be because the proxy is disabling the routes to
the application (IE: puts the gear in maintenance mode), because it is
failing health checks...

Delay issue with Websocket over SSL on Amazon's ELB

I followed the instructions from this link:
How do you get Amazon's ELB with HTTPS/SSL to work with Web Sockets? to set up ELB to work with Websocket (having ELB forward 443 to 8443 on TCP mode). Now I am seeing this issue for wss: server sends message1, client does not receive it; after few seconds, server sends message2, client receives both messages (both messages are around 30 bytes). I can reproduce the issue fairly easily. If I set up port forwarding with iptable on the server and have client connecting directly to the server (port 443), I don't have the problem Also, the issue seems to happen only to wss. ws works fine.
The server is running jetty8.
I checked EC2 forums and did not really find anything. I am wondering if anyone has seen the same issue.
Thanks
From what you describe, this pretty likely is a buffering issue with ELB. Quick research suggests that this actually is the issue.
From the ELB docs:
When you use TCP for both front-end and back-end connections, your
load balancer will forward the request to the back-end instances
without modification to the headers. This configuration will also not
insert cookies for session stickiness or the X-Forwarded-* headers.
When you use HTTP (layer 7) for both front-end and back-end
connections, your load balancer parses the headers in the request and
terminates the connection before re-sending the request to the
registered instance(s). This is the default configuration provided by
Elastic Load Balancing.
From the AWS forums:
I believe this is HTTP/HTTPS specific but not configurable but can't
say I'm sure. You may want to try to use the ELB in just plain TCP
mode on port 80 which I believe will just pass the traffic to the
client and vice versa without buffering.
Can you try to make more measurements and see how this delay depends on the message size?
Now, I am not entirely sure what you already did and what failed and what did not fail. From the docs and the forum post, however, the solution seems to be using the TCP/SSL (Layer 4) ELB type for both, front-end and back-end.
This resonates with "Nagle's algorithm" ... the TCP stack could be configured to bundling requests before sending them over the wire to reduce traffic. This would explain the symptoms, but worth a try

All jmeter requests going to only one server with haproxy

I'm using Jmeter to load test my web application. I have two web servers and we are using HAProxy for load balance. All my tests are running fine and configured correctly. I have three jmeter remote clients so I can run my tests distributed. The problem I'm facing is that ALL my jmeter requests are only being processed by one of the web servers. For some reason it's not balancing and I'm having many time outs, and huge response times. I've looked around a lot for a way to make these requests being balanced, but I'm having no luck so far. Does anyone know what can be the cause of this behavior? Please let me know if you need to know anything about my environment first and I will provide the answers.
Check your haproxy configuration:
What is it's load balancing policy, if not round-robin is it based on ip source or some other info that might be common to your 3 remote servers?
Are you sure load balancing is working right? Try testing with browser first, if you can add some information about the web server in response to debug.
Check your test plan:
Are you sure you don't have somewhere in your requests a sessionid that is hardcoded?
How many threads did you configure?
In your Jmeter script by default the HTTP Request "Use KeepAlive" header option is checked.
Keep-Alive is a header that maintains a persistent connection between
client and server, preventing a connection from breaking
intermittently. Also known as HTTP keep-alive, it can be defined as a
method to allow the same TCP connection for HTTP communication instead
of opening a new connection for each new request.
This may cause all requests to go to the same server. Just uncheck the option and save, stop your script and re-run.

Can I use Apache mod_proxy as a connection pool, under the Prefork MPM?

Summary/Quesiton:
I have Apache running with Prefork MPM, running php. I'm trying to use Apache mod_proxy to create a reverse proxy that I can re-route my requests through, so that I can use Apache to do connection pooling. Example impl:
in httpd.conf:
SSLProxyEngine On
ProxyPass /test_proxy/ https://destination.server.com/ min=1 keepalive=On ttl=120
but when I run my test, which is the following command in a loop:
curl -G 'http://localhost:80/test_proxy/testpage'
it doesn't seem to re-use the connections.
After some further reading, it sounds like I'm not getting connection pool functionality because I'm using the Prefork MPM rather than the Worker MPM. So each time I make a request to the proxy, it spins up a new process with its own connection pool (of size one), instead of using the single worker that maintains its own pool. Is that interpretation right?
Background info:
There's an external server that I make requests to, over https, for every page hit on a site that I run.
Negotiating the SSL handshake is getting costly, because I use php and it doesn't seem to support connection pooling - if I get 300 page requests to my site, they have to do 300 SSL handshakes to the external server, because the connections get closed after each script finishes running.
So I'm attempting to use a reverse proxy under Apache to function as a connection pool, to persist the connections across php processes so I don't have to do the SSL handshake as often.
Sources that gave me this idea:
http://httpd.apache.org/docs/current/mod/mod_proxy.html
http://geeksnotes.livejournal.com/21264.html
First of all, your test method cannot demonstrate connection pooling since for every call, a curl client is born and then it dies. Like dead people don't talk a lot, a dead process cannot keep a connection alive.
You have clients that bothers your proxy server.
Client ====== (A) =====> ProxyServer
Let's call this connection A. Your proxy server does nothing, it is just a show off. The handsome and hardworking server is so humble that he hides behind.
Client ====== (A) =====> ProxyServer ====== (B) =====> WebServer
Here, if I am not wrong, the secured connection is A, not B, right?
Repeating my first point, on your test, you are creating a separate client for each request. Every client needs a separate connection. Connection is something that happens between at least two parties. One side leaves and connection is lost.
Okay, let's forget curl now and look together at what we really want to do.
We want to have SSL on A and we want A side of traffic to be as fast as possible. For this aim, we have already separated side B so it will not make A even slower, right?
Connection pooling? There is no such thing as connection pooling at A. Every client comes and goes making a lot of noise. Only thing that can help you to reduce this noise is "Keep-Alive" which means, keeping connection alive from a client for some short period of time so this very same client can ask for other files that will be required by this request. When we are done, we are done.
For connections on B, connections will be pooled; but this will not bring you any performance since on one-server setup you did not have this part of the noise production.
How do we help this system run faster?
If these two servers are on the same machine, we should get rid of the show-off server and continue with our hardworking webserver. It adds a lot of unnecessary work to the system.
If these are separate machines, then you are being nice to web server by taking at least encyrption (for ssl) load from this poor guy. However, you can be even nicer.
If you want to continue on Apache, switch to mpm_worker from mpm_prefork. In case of 300+ concurrent requests, this will work much better. I really have no idea about the capacity of your hardware; but if handling 300 requests is difficult, I believe this little change will help your system a lot.
If you want to have an even more lightweight system, consider nginx as an alternative to Apache. It is very easy to setup to work with PHP and it will have a better performance.
Other than front-end side of things, also consider checking your database server. Connection pooling will make real difference here. Be sure if your PHP installation is configured to reuse connections to database.
In addition, if you are hosting static files on the same system, then move them out either on another web server or do even better by moving static files to a cloud system with CDN like AWS's S3+CloudFront or Rackspace's CloudFiles. Even without CloudFront, S3 will make you happy. Rackspace's solution comes with Akamai!
Taking out static files will make your web server "oh what happened, what is this silence? ohhh heaven!" since you mentioned this is a website and web pages have many static files for each dynamically generated html page most of the time.
I hope you can save the poor guy from the killer work.
Prefork can still pool 1 connection per backend server per process.
Prefork doesn't necessarily create a new process for each frontend request, the server processes are "pooled" themselves and the behavior depends on e.g. MinSpareServers/MaxSpareServers and friends.
To maximise how often a prefork process will have a backend connection for you, avoid very high or low maxspareservers or very high minspareservers as these will result in "fresh" processes acceptin new connections.
You can log %P in your LogFormat directive to help get an idea if how often processes are being reused.
The Problem in my case was, the the connection pooling between reverse proxy and backend server was not taking place because of the Backend Server Apache closing the SSL connection at the end of each HTTPS request.
The backend Apache Server was doing this becuse of the following Directive being present in the httpd.conf:
SetEnvIf User-Agent ".*MSIE.*" nokeepalive ssl-unclean-shutdown
This directive does not make sense when the backend server is connected via a reverse proxy and this can be removed from the backend server config.