Google Compute Engine: how to find why load balancing health checks are failing?

Google Compute Engine: how to find why load balancing health checks are failing? - load-balancing

I've been trying to create a Google Compute Engine network load balancing health check for an HTTPS (port 443) endpoint. The same endpoint when accessed over HTTP (port 80) is healthy. Also, the HTTPS endpoint, when accessed for example with curl correctly returns a 200 OK response, which would be the required condition for an healthy check.
It would be extremely helpful if there was a way to access a more detailed error report of why the health check is failing, because it's probably something quite easy to fix, but the total lack of detailed information in the web interface makes it random guess work. Trying to research information about where to find detailed information about why a health check is failing I have come up empty.

I believe this is because load balancer health checks don't currently support HTTPS.

The 200 ok is related to the health check, but if your TCP connection is not closed properly, this can cause this issue. If you run this command: tcpdump -A -n host your_host ip, you can confirm that the TCP connection is closed with a FIN/ACK status. If you see [R] flag in the output, it indicates that the connection is being reset instead of closing properly.
For more information, visit this link https://developers.google.com/compute/docs/load-balancing/health-checks#steps_to_set_up_health_checks

Related

Is there any way to increase the cloudflare proxy request timeout limit(524)? [duplicate]

Is it possible to increase CloudFlare's time-out? If yes, how?
My code takes a while to execute and I wasn't planning on Ajaxifying it the coming days.

No, CloudFlare only offers that kind of customisation on Enterprise plans.
CloudFlare will time out if it fails to establish a HTTP handshake after 15 seconds.
CloudFlare will also wait 100 seconds for a HTTP response from your server before you will see a 524 timeout error.
Other than this there can be timeouts on your origin web server.
It sounds like you need Inter-Process Communication. HTTP should not be used a mechanism for performing blocking tasks without sending responses, these kind of activities should instead be abstracted away to a non-HTTP service on the server. By using RabbitMQ (or any other MQ) you can then pass messages from the HTTP element of your server over to the processing service on your webserver.

I was in communication with Cloudflare about the same issue, and also with the technical support of RabbitMQ.
RabbitMQ suggested using Web Stomp which relies on Web Sockets. However Cloudflare suggested...
Websockets would create a persistent connection through Cloudflare and
there's no timeout as such, but the best way of resolving this would
be just to process the request in the background and respond asynchronously, and serve a 'Loading...' page or similar, rather than having the user to wait for 100 seconds. That would also give a better user experience to the user as well
UPDATE:
For completeness, I will also record here that
I also asked CloudFlare about running the report via a subdomain and "grey-clouding" it and they replied as follows:
I will suggest to verify on why it takes more than 100 seconds for the
reports. Disabling Cloudflare on the sub-domain, allow attackers to
know about your origin IP and attackers will be attacking directly
bypassing Cloudflare.
FURTHER UPDATE
I finally solved this problem by running the report using a thread and using AJAX to "poll" whether the report had been created. See Bypassing CloudFlare's time-out of 100 seconds

Cloudflare doesn't trigger 504 errors on timeout
504 is a timeout triggered by your server - nothing to do with Cloudflare.
524 is a timeout triggered by Cloudflare.
See: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#502504error
524 error? There is a workaround:
As #mjsa mentioned, Cloudflare only offers timeout settings to Enterprise clients, which is not an option for most people.
However, you can disable Cloudflare proxing for that specific (sub)domain by turning the orange cloud into grey:
Before:
After:
Note: it will disable extra functionalities for that specific (sub)domain, including IP masking and SSL certificates.
As Cloudflare state in their documentation:
If you regularly run HTTP requests that take over 100 seconds to
complete (for example large data exports), consider moving those
long-running processes to a subdomain that is not proxied by
Cloudflare. That subdomain would have the orange cloud icon toggled to
grey in the Cloudflare DNS Settings . Note that you cannot use a Page
Rule to circumvent Error 524.

I know that it cannot be treated like a solution but there is a 2 ways of avoiding this.
1) Since this timeout is often related to long time generating of something, this type of works can be done through crontab or if You have access to SSH you can run a PHP command directly to execute. In this case connection is not served through Cloudflare so it goes as long as your configuration allows it to run. Check it on Google how to run scripts from command line or how to determine them in crontab by using /usr/bin/php /direct/path/to/file.php
2) You can create subdomain that is not added to cloudlflare and move Your script there and run them directly through URL, Ajax call or whatever.
There is a good answer on Cloudflare community forums about this:
If you need to have scripts that run for longer than around 100 seconds without returning any data to the browser, you can’t run these through Cloudflare. There are a couple of options: Run the scripts via a grey-clouded subdomain or change the script so that it kicks off a long-running background process and quickly returns a status which the browser can poll until the background process has completed, at which point the full response can be returned. This is the way most people do this type of action as keeping HTTP connections open for a long time is unreliable and can be very taxing also.
This topic on Stackoverflow is high in SERPs so I decided to write down this answer for those who will find it usefull.

https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#502504error
Cloudflare 524 error results from a web page taking more than 100 seconds to completely respond.
This can be overridden to (up to) 600 seconds ... if you change to "Enterprise" Cloudflare account. The cost of Enterprise is roughtly $40k per year (annual contract required).

If you are getting your results with curl, you could use the resolve option to directly access your IP, not using the Cloudflare proxy IP:
For example:
curl --max-time 120 -s -k --resolve lifeboat.com:443:127.0.0.1 -L https://lifeboat.com/blog/feed

The simplest way to do this is to increase your proxy waiting timeout.
If you are using Nginx for instance you can simply add this line in your /etc/nginx/sites-availables/your_domain:
location / {
...
proxy_read_timeout 600s; # this increases it by 10mins; feel free to change as you see fit with your needs.
...
}
If the issue persists, make sure you use let's encrypt to secure your server alongside Nginx and then disable the orange cloud on that specific subdomain on Cloudflare.
Here are some resources you can check to help do that
installing-nginx-on-ubuntu-server
secure-nginx-with-let's-encrypt

HTTPS connection stops working after a few minutes

I have the following setup:
Service Fabric cluster running 5 machines, with several services running in Docker containers
A public IP which has port 443 open, forwarding to the service running Traefik
Traefik terminates the SSL, and proxies the request over to the service being requested over HTTP
Here's the behavior I get:
The first request to https:// is very, very slow. Chrome will usually eventually load it after a time timeouts or "no content" errors. Invoke-WebRequest in Powershell usually just times out with an "The underlying connection was closed" message.
However, once it loads, I can refresh things or run the command again and it responds very, very quickly. It'll work as long as there's regular traffic to the URL.
If I leave for a bit (not sure on the time, definitely a few minutes) it dies and goes back to the beginning.
My Question:
What would cause SSL handshakes to just break or take forever? What component in this stack is to blame? Is something in Service Fabric timing out? Is it a Traefik thing? I could switch over to Nginx if it's more stable. We use these same certs on IIS, and we don't have this problem.
I could use something like New Relic to constantly send a ping every minute to keep things alive, but I'd rather figure out why the connection is dying after a few minutes.
What are the best ways to go about debugging this? I don't see anything in the Traefik log files (In DEBUG mode), in fact when it doesn't connect, there's no record of the request in the access logs at all. Any tools that could help debug this? Thanks!

Is the Traefik service healthy on all 5 nodes, can you inspect the logs of all 5 instances? If not this might cause the Azure Load Balancer to load balance across nodes where Traefik is not listening which would cause intermittent and slow responses. Once a healthy Traefik responds, you'll get a sticky session cookie which will then make subsequent responses faster. You can enable ApplicationInsights monitoring for Traefik logs to save you crawling across all the machines: https://github.com/jjcollinge/traefik-on-service-fabric#debugging. I'd also recommend testing this without SSL to ensure Traefik can route correctly over HTTP first and then add HTTPS. That way you'll know it's something to do with the SSL configuration (i.e. mounted the certificates correctly, Traefik toml config, trusted certificates, etc.)

health check for apache knox

I want to create a health check mechanism to make sure I remove unhealthy Knox instances that are configured behind a load balancer.
Normal ping to the underlying instances will help check whether the machine is reachable or not. But it will not help determine if the gateway is healthy/running to serve incoming requests to that instance.
I can make a request to Knox through the LB, but it will goto only one instance and there is no way of knowing it.
I want to know if there is any way to determine the same? Or is there a mechanism that is provided in Knox itself though which I can make a http (non-secure, as direct https calls to the instance is not permitted) call to the gateway server and determine?
Thanks!!

I am not sure which Load balancer you are using. From the "health check" I am assuming you are using Elastic Load Balancer.
Create a health check with tcp protocol. It will only check whether those port are open or not. If the knox is not running those instances will go to out of service and the incoming requests will be re directed to the instances which are in service .
PFB the screenshots for the same.

I don't know how your load balancer is configured but you could try pinging knox_host:knox_port directly, this would at-least tell you whether knox is up and running (and listening).
If you would want to know whether Knox is healthy (specifically your topology) then you can try issuing a test request periodically and look for the response code 200.
e.g.
curl -i -u guest:guest-password -X GET \
'http://<direct-knox>:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'
Hope that helps !

Forward Traffic on Port through SSH Reverse Tunnel

I have an interesting scenario. I've searched every where, and I have bits and pieces of information, however, I don't have the full picture, and it's driving me nuts.
I also want to mention I'm no where near sysadmin status, however, I can get around my infrastructure with enough to get the job done.
I've got 3 end points. I've got a device inside a network (endpoint#1), that's setup a reverse tunnel to one of my servers (endpoint#2). I've got another server that has to send requests (endpoint#3) to the device (endpoint#1) through the connection server (endpoint#2).
I'm currently able to sustain connections between endpoint#1 and endpoint#2, and send requests from endpoint#2 to endpoint#1 without issue, however, I need endpoint#3 to be able to talk to endpoint#1 through endpoint#2.
I've tried searching for port forwarding scenarios and reverse tunnel scenarios, however, whatever it is that I'm doing is not allowing network traffic through.
How can I set up http traffic to GET/POST from endpoint#3 to endpoint#2 and pass through to endpoint#1 through the specified reverse tunnel (on it's specified port)? HELP!

Found the answer. It's using roughly the same syntax that I'm using on SSH to setup the remote server, however, it's adding the binding ip address (interal ip address of the network that it's on) and using GatewayPorts clientspecified in the sshd_config (although, I'm not 100% I needed this - it is an option I set though).
On endpoint#1:
- ssh -R [endpoint#2.internal.ip.address]:[port]:[localhost]:[port-to-map-to-on-endpoint#1] user#endpoint#2
On endpoint#3:
- curl -X POST -d {data} http://endpoint#2.internal.ip.address/path/to/resource
This will then allow the call on endpoint#3 to be passed through to endpoint#1.

Delay issue with Websocket over SSL on Amazon's ELB

I followed the instructions from this link:
How do you get Amazon's ELB with HTTPS/SSL to work with Web Sockets? to set up ELB to work with Websocket (having ELB forward 443 to 8443 on TCP mode). Now I am seeing this issue for wss: server sends message1, client does not receive it; after few seconds, server sends message2, client receives both messages (both messages are around 30 bytes). I can reproduce the issue fairly easily. If I set up port forwarding with iptable on the server and have client connecting directly to the server (port 443), I don't have the problem Also, the issue seems to happen only to wss. ws works fine.
The server is running jetty8.
I checked EC2 forums and did not really find anything. I am wondering if anyone has seen the same issue.
Thanks

From what you describe, this pretty likely is a buffering issue with ELB. Quick research suggests that this actually is the issue.
From the ELB docs:
When you use TCP for both front-end and back-end connections, your
load balancer will forward the request to the back-end instances
without modification to the headers. This configuration will also not
insert cookies for session stickiness or the X-Forwarded-* headers.
When you use HTTP (layer 7) for both front-end and back-end
connections, your load balancer parses the headers in the request and
terminates the connection before re-sending the request to the
registered instance(s). This is the default configuration provided by
Elastic Load Balancing.
From the AWS forums:
I believe this is HTTP/HTTPS specific but not configurable but can't
say I'm sure. You may want to try to use the ELB in just plain TCP
mode on port 80 which I believe will just pass the traffic to the
client and vice versa without buffering.
Can you try to make more measurements and see how this delay depends on the message size?
Now, I am not entirely sure what you already did and what failed and what did not fail. From the docs and the forum post, however, the solution seems to be using the TCP/SSL (Layer 4) ELB type for both, front-end and back-end.

This resonates with "Nagle's algorithm" ... the TCP stack could be configured to bundling requests before sending them over the wire to reduce traffic. This would explain the symptoms, but worth a try

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas