Nomad High Availability with Traefik

Nomad High Availability with Traefik - traefik

I've decided to give Nomad a try, and I'm setting up a small environment for side projects in my company.
Although the documentation on Nomad/Consul is nice and detailed, they don't reach the simple task of exposing a small web service to the world.
Following this official tutorial to use Traefik as a load balancer, how can I make those exposed services reachable?
The tutorial has a footnote stating that the services could be accessed from outside the cluster by port 8080.
But in a cluster where I have 3 servers and 3 clients, where should I point my DNS to?
Should a DNS with failover pointing to the 3 clients be enough?
Do I still need a load balancer for the clients?

There are multiple ways you could handle distributing the requests across your servers. Some may be more preferable than the other depending on your deployment environment.
The Fabio load balancer docs have a section on deployment configurations which I'll use as a reference.
Direct with DNS failover
In this model, you could configure DNS to point to the IPs of all three servers. Clients would receive all three IPs back in response to a DNS query, and randomly connect to one of the available instances.
If an IP is unhealthy, the client should retry the request to one of the other IPs, but clients may experience slower response times if a server is unavailable for an extended period of time and the client is occasionally routing requests to that unavailable IP.
You can mitigate this issue by configuring your DNS server to perform health checking of backend instances (assuming it supports it). AWS Route 53 provides this functionality (see Configuring DNS failover). If your DNS server does not support health checking, but provides an API to update records, you can use Consul Terraform Sync to automate adding/removing server IPs as the health of the Fabio instances changes in Consul.
Fabio behind a load balancer
As you mentioned the other option would be to place Fabio behind a load balancer. If you're deploying in the cloud, this could be the cloud provider's LB. The LB would give you better control over traffic routing to Fabio, provide TLS/SSL termination, and other functionality.
If you're on-premises, you could front it with any available load balancer like F5, A10, nginx, Apache Traffic Server, etc. You would need to ensure the LB is deployed in a highly available manner. Some suggestions for doing this are covered in the next section.
Direct with IP failover
Whether you're running Fabio directly on the Internet, or behind a load balancer, you need to make sure the IP which clients are connecting to is highly available.
If you're deploying on-premises, one method for achieving this would be to assign a common loopback IP each of the Fabio servers (e.g., 192.0.2.10), and then use an L2 redundancy protocol like Virtual Router Redundancy Protocol (VRRP) or an L3 routing protocol like BGP to ensure the network routes requests to available instances.
L2 failover
Keepalived is a VRRP daemon for Linux. There can find many tutorials online for installing and configure in.
L3 failover w/ BGP
GoCast is a BGP daemon built on GoBGP which conditionally advertises IPs to the upstream network based on the state of health checks. The author of this tool published a blog post titled BGP based Anycast as a Service which walks through deploying GoCast on Nomad, and configuring it to use Consul for health information.
L3 failover with static IPs
If you're deploying on-premises, a more simple configuration than the two aforementioned solutions might be to configure your router to install/remove static routes based on health checks to your backend instances. Cisco routers support this through their IP SLA feature. This tutorial walks through a basic setup configuration http://www.firewall.cx/cisco-technical-knowledgebase/cisco-routers/813-cisco-router-ipsla-basic.html.
As you can see, there are many ways to configure HA for Fabio or an upstream LB. Its hard to provide a good recommendation without knowing more about your environment. Hopefully one of these suggestions will be useful to you.

In the following case a network of nomad nodes are in 192.168.8.140-250 range and a floating IP 192.168.8.100 are the one that is DNAT from the firewall for 80/443 ports.
The traefik is coupled to a keepalived in the same group. The keepalived will assign its floating ip to the node where traefik is running. There will be only one keepalived in master state.
its not the keepalived "use case" but it's good at broadcast arp when it comes alive.
job "traefik" {
datacenters = ["dc1"]
type = "service"
group "traefik" {
constraint {
operator = "distinct_hosts"
value = "true"
}
volume "traefik_data_le" {
type = "csi"
source = "traefik_data"
read_only = false
attachment_mode = "file-system"
access_mode = "multi-node-multi-writer"
}
network {
port "http" {
static = 80
}
port "https" {
static = 443
}
port "admin" {
static = 8080
}
}
service {
name = "traefik-http"
provider = "nomad"
port = "http"
}
service {
name = "traefik-https"
provider = "nomad"
port = "https"
}
task "keepalived" {
driver = "docker"
env {
KEEPALIVED_VIRTUAL_IPS = "192.168.8.100/24"
KEEPALIVED_UNICAST_PEERS = ""
KEEPALIVED_STATE = "MASTER"
KEEPALIVED_VIRTUAL_ROUTES = ""
}
config {
image = "visibilityspots/keepalived"
network_mode = "host"
privileged = true
cap_add = ["NET_ADMIN", "NET_BROADCAST", "NET_RAW"]
}
}
task "server" {
driver = "docker"
config {
image = "traefik:2.9.6"
network_mode = "host"
ports = ["admin", "http", "https"]
args = [
"--api.dashboard=true",
"--entrypoints.web.address=:${NOMAD_PORT_http}",
"--entrypoints.websecure.address=:${NOMAD_PORT_https}",
"--entrypoints.traefik.address=:${NOMAD_PORT_admin}",
"--certificatesresolvers.letsencryptresolver.acme.email=email#email",
"--certificatesresolvers.letsencryptresolver.acme.storage=/letsencrypt/acme.json",
"--certificatesresolvers.letsencryptresolver.acme.httpchallenge.entrypoint=web",
"--entrypoints.web.http.redirections.entryPoint.to=websecure",
"--entrypoints.web.http.redirections.entryPoint.scheme=https",
"--providers.nomad=true",
"--providers.nomad.endpoint.address=http://192.168.8.140:4646" ### IP to your nomad server
]
}
volume_mount {
volume = "traefik_data_le"
destination = "/letsencrypt/"
}
}
}
}
For keepalived to run, you should allow some CAP in docker plugin config
plugin "docker" {
config {
allow_privileged = true
allow_caps = [...,"NET_ADMIN","NET_BROADCAST","NET_RAW"]
}
}

Related

Varnish 6.0lts won't handle secure websockets on a remote proxy?

I'm having a hard time with this setup. I have a node.js box serving HTTP on 3000, websockets on 3001, and secure websockets on 3002. Out in front of that I have a remote Hitch/Varnish caching proxy on its own server that's listening on 443/80 and connecting the first server as its default backend via 3000. A user who visits the site URL https://foo.tld hits the varnish proxy and sees the site, where some javascript on the site tells their browser to connect to wss://foo.tld:3002 for secure websockets.
My problem is getting websockets to pass transparently through to the backend. In the VCL I have the standard
if (req.http.upgrade ~ "(?i)websocket") {
return (pipe);
}
and
sub vcl_pipe {
#Declare pipe handler for websockets
if (req.http.upgrade) {
set bereq.http.upgrade = req.http.upgrade;
set bereq.http.connection = req.http.connection;
}
}
Which doesn't work in this case. To list what I have tried so far with no success:
1: Creating a second backend in VCL named "websockets" that is the same backend IP but on either port 3001 or 3002 and adding "set req.backend_hint = websockets;" before the pipe summon in the first snippet above.
2: Turning off HTTPS and trying to connect it over pure HTTP.
3: Modifying varnish.service to try and make varnish listen on ports other than, or in addition to, -a :80 and -a :8443,proxy, in which cases Varnish simply refuses to start. One attempt was to simply use HTTP only and attempt to run varnish on 3001 to get ws:// working without SSL but varnish refuses to start.
4: Most recently I attempted the following in VCL to try and pick up client connections coming in on 3001:
if (std.port(server.ip) == 3001) {
set req.backend_hint = websockets;
}
My goal is for the Varnish box to pick up secure websocket traffic (wss://) on 3002 (via hitch at 443 using the normal secure websocket connection protocol) and have that passed transparently to the backend websocket server, whether SSL encrypted across that leg of the connection or not. I have set up other, smaller servers like this before and getting websockets working is trivial if Varnish and the backend service are either on the same machine or behind a regulating CDN like Cloudflare, so it has been extra frustrating trying to figure out just what this remote proxy setup needs. I feel like part of the solution is having Varnish or Hitch (not sure) listening on 3002 to accept the connections at which point the normal req.http.upgrade and pipe functions would come into play, but the software refuses to cooperate.
--------Update--
I have broken down the problem into the simplest form I can. The main server (backend) is now serving plain HTTP on 8080 and WS:// on 6081. I have removed hitch and TLS from the equation entirely, but even in this simplified form it is impossible to connect to websockets through Varnish. I can verify that the Websocket server is working correctly on the backend. Connecting to the backend IP address with a browser shows websockets functioning perfectly there. It's Varnish that's the problem.
My current hitch.conf (not relevant here but provided per request):
frontend = "[*]:443"
frontend = "[*]:3001"
backend = "[127.0.0.1]:8443" # 6086 is the default Varnish PROXY port.
workers = 4 # number of CPU cores
daemon = on
# We strongly recommend you create a separate non-privileged hitch
# user and group
user = "redacted"
group = "redacted"
# Enable to let clients negotiate HTTP/2 with ALPN. (default off)
# alpn-protos = "h2, http/1.1"
# run Varnish as backend over PROXY; varnishd -a :80 -a localhost:6086,PROXY ..
write-proxy-v2 = on # Write PROXY header
syslog = on
log-level = 1
# Add pem files to this directory
# pem-dir = "/etc/pki/tls/private"
pem-file = "/redacted/hitch-bundle.pem"
Current default.vcl (stripped down to almost nothing just for testing this. The backend is NOT running on the same machine, it is remote):
# Marker to tell the VCL compiler that this VCL has been adapted to the
# new 4.0 format.
vcl 4.0;
# Default backend definition. Set this to point to your content server.
backend default {
.host = "remote.server.ip";
.port = "8080";
}
backend websockets {
.host = "remote.server.ip";
.port = "6081";
}
sub vcl_recv {
# Happens before we check if we have this in cache already.
#
# Typically you clean up the request here, removing cookies you don't need,
# rewriting the request, etc.
#Allow websockets to pass through the cache (summons pipe handler below)
if (req.http.Upgrade ~ "(?i)websocket") {
set req.backend_hint = websockets;
return (pipe);
} else {
set req.backend_hint = default;
}
}
sub vcl_pipe {
if (req.http.upgrade) {
set bereq.http.upgrade = req.http.upgrade;
set bereq.http.connection = req.http.connection;
}
return (pipe);
}
Varnish's systemd exec parameters:
ExecStart=/usr/sbin/varnishd \
-a http=:80 \
-a proxy=localhost:8443,PROXY \
-a ws=:6081 \
-p feature=+http2 \
-f /etc/varnish/default.vcl \
-s malloc,256m \
-p pipe_timeout=1800
Working in plain HTTP and insecure websockets like this, it should be very simple to get a working model. I don't understand what could possibly be going wrong.

Varnish Cache, the open source version of Varnish, doesn't support backend connections over TLS.
While you can offload TLS using Hitch, the connection to your websocket server will not be encrypted.
Basic VCL example
Here's a very basic VCL example where web & websocket requests are split and sent to separate backends:
vcl 4.1;
backend web {
.port = "3000";
}
backend ws {
.port = "3001";
}
sub vcl_recv {
if (req.http.Upgrade ~ "(?i)websocket") {
set req.backend_hint = ws;
return (pipe);
} else {
set req.backend_hint = web;
}
}
sub vcl_pipe {
if (req.http.upgrade) {
set bereq.http.upgrade = req.http.upgrade;
}
return (pipe);
}
Need more input
However, I'm probably missing a lot of context. I also didn't specify a .host parameter in the backends, so the assumption is that all services are hosted locally.
Please add your full VCL, your Hitch config and the varnishd runtime parameters to your question. This will add context and allows me to come up with a better solution.
What about Hitch?
If you terminate TLS in Hitch, both HTTPS & secure websockets will be handled by Hitch where the plain-text HTTP & websockets will still be directly handeled by Varnish.
See https://www.varnish-software.com/developers/tutorials/terminate-tls-varnish-hitch for a Hitch tutorial that also explains how Varnish should be configured.
I'm a big advocate of using the PROXY protocol in Varnish. The hitch tutorial has a specific section about this: https://www.varnish-software.com/developers/tutorials/terminate-tls-varnish-hitch/#enable-the-proxy-protocol-in-varnish
Custom ports
The standard ports to access the service are 80 for HTTP and insecure websockets and 443 for HTTPS and secure websockets.
If you want to use custom ports for the websockets, it is possible to configure them in Hitch and Varnish.
Let's say you want to main ports 3001 and 3002 for your websockets. This means you need 2 frontends in Hitch:
One for HTTPs on 443
One for secure WS on 3002
See https://www.varnish-software.com/developers/tutorials/terminate-tls-varnish-hitch/#listening-address for more information about the frontend config.
Varnish on the other hand needs to have 3 listening addresses:
One for HTTP on port 80 (-a http=:80)
One for offloaded HTTPS & secure WS with PROXY support on port 8443 (-a proxy=:8443,PROXY)
One for insecure WS on port 3001 (-a ws=:3001)
Next steps
Please use the information and see if this helps to find a solution. If not, please share your VCL file, your Hitch config and varnishd runtime.
Update
Now that you provided more input, the picture starts to become more clear. The fact that you eliminated the TLS part for now will make it a lot easier to debug.
Assuming the names of your listening interfaces for varnishd are http and ws (as mentioned in your systemd unit file), we can use the following varnishlog commands to debug:
varnishlog -g request -q "ReqStart[3] eq 'http'"
This command will show logs for all log transactions where the http listening interface is used.
If you want to make it more granular, you can also add the request URL as a filtering criterium. This will narrow down the number of transactions:
varnishlog -g request -q "ReqStart[3] eq 'http' and ReqUrl eq '/'"
Please add a complete log transaction for one of the failed requests. This will help us understand why requests are failing.
You can do the same for requests on the ws listening interface by using the commands below:
varnishlog -g request -q "ReqStart[3] eq 'ws'"
varnishlog -g request -q "ReqStart[3] eq 'ws' and ReqUrl eq '/'"
I'm assuming you're successful at starting the varnishd program but unsuccessful at getting decent output out of Varnish. The varnishlog program will provide the insight we need. Please add the logging output to your question so I can look into it.

Terraform: Security group to connect an ASG to an ALB

My settings
I am building a pretty standard stack of an Auto Scale group of EC2 instances, receiving traffic via an ALB.
The instances expose port 80 to the entire VPC, and the ALB exposes port 443 with a certificate externally to receive traffic from the Internet.
My problem
I would like enable port 80 access from the ALB only, not from the entire VPC.
My question
How can I define a security group in Terraform, that exposes port 80 of the instances to the ALB only, but not to other parts of the VPC?

Maybe this is helpful:
resource "aws_security_group_rule" "opened_to_alb" {
type = "ingress"
from_port = 80
to_port = 80
protocol = "tcp"
source_security_group_id = "${var.alb_sg_id}"
security_group_id = "${aws_security_group.your_sg.id}"
}
var.alb_sg_id can be replaced with your actual alb security group id.

You can create a security group (aws_security_group) for your load balancer and give that access to port 80 using an aws_security_group_rule.
Then, in the security group of the servers, you want to only have allow access to them from for port 80 only from the load balancer security group like so source_security_group_id = "${aws_security_group.mylb.id}"
Does this make sense? If not, I can elaborate further.

kubernetes intra cluster service communication

I have a composite service S.c which consumes 2 atomic service S.a and S.b where all three services are running in Kubernetes cluster. What would be a better pattern
1) Create Sa,Sb as a headless service and let Sc integrate with them via external Loadbalancer like NGINX+ (which uses DNS resolver to maintain updated backend pods)
2) Create Sa,Sb with clusterIP and let Sc access/resolve them via cluster DNS (skyDNS addon). Which will internally leverage IP-Table based load-balancing to pods.
Note: My k8s cluster is running on custom solution (on-premise VMs)
We have many composite services which consume 1 to many atomic services (like example above).
Edit: In few scenarios I would also need to expose services to external network
like Sb would need access both from Sc and outside.
In such it would make more sense to create Sb as a headless service, otherwise DNS resolver would always return only the clusterIP address and all external request will also get routed to clusterIP address.
My challenge is both scenarios (intra vs inter) are conflicting with each other
example: nginx-service (which has clusterIP) and nginx-headless-service (headless)
/ # nslookup nginx-service
Server: 172.16.48.11
Address 1: 172.16.48.11 kube-dns.kube-system.svc.cluster.local
Name: nginx-service
Address 1: 172.16.50.29 nginx-service.default.svc.cluster.local
/ # nslookup nginx-headless-service
Server: 172.16.48.11
Address 1: 172.16.48.11 kube-dns.kube-system.svc.cluster.local
Name: nginx-headless-service
Address 1: 11.11.1.13 wrkfai1asby2.my-company.com
Address 2: 11.11.1.15 imptpi1asby.my-company.com
Address 3: 11.11.1.4 osei623a-sby.my-company.com
Address 4: 11.11.1.6 osei511a-sby.my-company.com
Address 5: 11.11.1.7 erpdbi02a-sbyold.my-company.com

Using DNS + cluster IPs is the simpler approach, and doesn't require exposing your services to the public internet. Unless you want specific load-balancing features from nginx, I'd recommend going with #2.

Can not link a HTTP Load Balancer to a backend (502 Bad Gateway)

I have on the backend a Kubernetes node running on port 32656 (Kubernetes Service of type NodePort). If I create a firewall rule for the <node_ip>:32656 to allow traffic, I can open the backend in the browser on this address: http://<node_ip>:32656.
What I try to achieve now is creating an HTTP Load Balancer and link it to the above backend. I use the following script to create the infrastructure required:
#!/bin/bash
GROUP_NAME="gke-service-cluster-61155cae-group"
HEALTH_CHECK_NAME="test-health-check"
BACKEND_SERVICE_NAME="test-backend-service"
URL_MAP_NAME="test-url-map"
TARGET_PROXY_NAME="test-target-proxy"
GLOBAL_FORWARDING_RULE_NAME="test-global-rule"
NODE_PORT="32656"
PORT_NAME="http"
# instance group named ports
gcloud compute instance-groups set-named-ports "$GROUP_NAME" --named-ports "$PORT_NAME:$NODE_PORT"
# health check
gcloud compute http-health-checks create --format none "$HEALTH_CHECK_NAME" --check-interval "5m" --healthy-threshold "1" --timeout "5m" --unhealthy-threshold "10"
# backend service
gcloud compute backend-services create "$BACKEND_SERVICE_NAME" --http-health-check "$HEALTH_CHECK_NAME" --port-name "$PORT_NAME" --timeout "30"
gcloud compute backend-services add-backend "$BACKEND_SERVICE_NAME" --instance-group "$GROUP_NAME" --balancing-mode "UTILIZATION" --capacity-scaler "1" --max-utilization "1"
# URL map
gcloud compute url-maps create "$URL_MAP_NAME" --default-service "$BACKEND_SERVICE_NAME"
# target proxy
gcloud compute target-http-proxies create "$TARGET_PROXY_NAME" --url-map "$URL_MAP_NAME"
# global forwarding rule
gcloud compute forwarding-rules create "$GLOBAL_FORWARDING_RULE_NAME" --global --ip-protocol "TCP" --ports "80" --target-http-proxy "$TARGET_PROXY_NAME"
But I get the following response from the Load Balancer accessed through the public IP in the Frontend configuration:
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The health check is left with default values: (/ and 80) and the backend service responds quickly with a status 200.
I have also created the firewall rule to accept any source and all ports (tcp) and no target specified (i.e. all targets).
Considering that regardless of the port I choose (in the instance group), and that I get the same result (Server Error), the problem should be somewhere in the configuration of the HTTP Load Balancer. (something with the health checks maybe?)
What am I missing from completing the linking between the frontend and the backend?

I assume you actually have instances in the instance group, and the firewall rule is not specific to a source range. Can you check your logs for a google health check? (UA will have google in it).
What version of kubernetes are you running? Fyi there's a resource in 1.2 that hooks this up for you automatically: http://kubernetes.io/docs/user-guide/ingress/, just make sure you do these: https://github.com/kubernetes/contrib/blob/master/ingress/controllers/gce/BETA_LIMITATIONS.md.
More specifically: in 1.2 you need to create a firewall rule, service of type=nodeport (both of which you already seem to have), and a health check on that service at "/" (which you don't have, this requirement is alleviated in 1.3 but 1.3 is not out yet).
Also note that you can't put the same instance into 2 loadbalanced IGs, so to use the Ingress mentioned above you will have to cleanup your existing loadbalancer (or at least, remove the instances from the IG, and free up enough quota so the Ingress controller can do its thing).

There can be a few things wrong that are mentioned:
firewall rules need to be set to all hosts, are they need to have the same network label as the machines in the instance group have
by default, the node should return 200 at / - readiness and liveness probes to configure otherwise did not work for me
It seems you try to do things that are all automated, so I can really recommend:
https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress
This shows the steps that do the firewall and portforwarding for you, which also may show you what you are missing.
I noticed myself when using an app on 8080, exposed on 80 (like one of the deployments in the example) that the load balancer staid unhealthy untill I had / returning 200 (and /healthz I added to). So basically that container now exposes a webserver on port 8080, returning that and the other config wires that up to port 80.
When it comes to firewall rules, make sure they are set to all machines or make the network label match, or they won't work.
The 502 error is usually from the loadbalancer that will not pass your request if the health check does not pass.

Could you make your service type LoadBalancer (http://kubernetes.io/docs/user-guide/services/#type-loadbalancer) which would setup this all up automatically? This assumes you have the flag set for google cloud.
After you deploy, then describe the service name and should give you the endpoint which is assigned.

JMeter with remote servers

I'm trying to setup JMeter in a distributed mode.
I have a server running on an ec2 intance, and I want the master to run on my local computer.
I had to jump through some hopes to get RMI working correctly on the server but was solved with setting the "java.rmi.server.hostname" to the IP of the ec2 instance.
The next (and hopefully last) problem is the server communicating back to the master.
The problem is that because I am doing this from an internal network, the master is sending its local/internal ip address (192.168.1.XXX) when it should be sending back the IP of my external connection (92.XXX.XXX.XXX).
I can see this in the jmeter-server.log:
ERROR - jmeter.samplers.RemoteListenerWrapper: testStarted(host) java.rmi.ConnectException: Connection refused to host: 192.168.1.50; nested exception is:
That host IP is wrong. It should be the 92.XXX.XXX.XX address. I assume this is because in the master logs I see the following:
2012/07/29 20:45:25 INFO - jmeter.JMeter: IP: 192.168.1.50 Name: XXXXXX.local FullName: 192.168.1.50
And this IP is sent to the server during RMI setup.
So I think I have two options:
Tell the master to send the external IP
Tell the server to connect on the external IP of the master.
But I can't see where to set these commands.
Any help would be useful.

For the benefit of future readers, don't take no for an answer. It is possible! Plus you can keep your firewall in place.
In this case, I did everything over port 4000.
How to connect a JMeter client and server for distributed testing with Amazon EC2 instance and local dev machine across different networks.
Setup:
JMeter 2.13 Client: local dev computer (different network)
JMeter 2.13 Server: Amazon EC2 instance
I configured distributed client / server JMeter connectivity as follows:
1. Added a port forwarding rule on my firewall/router:
Port: 4000
Destination: JMeter client private IP address on the LAN.
2. Configured the "Security Group" settings on the EC2 instance:
Type: Allow: Inbound
Port: 4000
Source: JMeter client public IP address (my dev computer/network public IP)
Update: If you already have SSH connectivity, you could use an SSH tunnel for the connection, that will avoid needing to add the firewall rules.
$ ssh -i ~/.ssh/54-179-XXX-XXX.pem ServerAliveInterval=60 -R 4000:localhost:4000 jmeter#54.179.XXX.XXX
3. Configured client $JMETER_HOME/bin/jmeter.properties file RMI section:
note only the non-default values that I changed are included here:
#---------------------------------------------------------------------------
# Remote hosts and RMI configuration
#---------------------------------------------------------------------------
# Remote Hosts - comma delimited
# Add EC2 JMeter server public IP address:Port combo
remote_hosts=127.0.0.1,54.179.XXX.XXX:4000
# RMI port to be used by the server (must start rmiregistry with same port)
server_port=4000
# Parameter that controls the RMI port used by the RemoteSampleListenerImpl (The Controler)
# Default value is 0 which means port is randomly assigned
# You may need to open Firewall port on the Controller machine
client.rmi.localport=4000
# To change the default port (1099) used to access the server:
server.rmi.port=4000
# To use a specific port for the JMeter server engine, define
# the following property before starting the server:
server.rmi.localport=4000
4. Configured remote server $JMETER_HOME/bin/jmeter.properties file RMI section as follows:
#---------------------------------------------------------------------------
# Remote hosts and RMI configuration
#---------------------------------------------------------------------------
# RMI port to be used by the server (must start rmiregistry with same port)
server_port=4000
# Parameter that controls the RMI port used by the RemoteSampleListenerImpl (The Controler)
# Default value is 0 which means port is randomly assigned
# You may need to open Firewall port on the Controller machine
client.rmi.localport=4000
# To use a specific port for the JMeter server engine, define
# the following property before starting the server:
server.rmi.localport=4000
5. Started the JMeter server/slave with:
jmeter-server -Djava.rmi.server.hostname=54.179.XXX.XXX
where 54.179.XXX.XXX is the public IP address of the EC2 server
6. Started the JMeter client/master with:
jmeter -Djava.rmi.server.hostname=121.73.XXX.XXX
where 121.73.XXX.XXX is the public IP address of my client computer.
7. Ran a JMeter test suite.
JMeter GUI log output
Success!

I had a similar problem: the JMeter server tried to connect to the wrong address for sending the results of the test (it tried to connect to localhost).
I solved this by setting the following parameter when starting the JMeter master:
-Djava.rmi.server.hostname=xx.xx.xx.xx

It looks as though this wont work Distributed JMeter Testing explains the requirements for load testing in a distributed environment. Number 2 and 3 are particular to your use case I believe.
The firewalls on the systems are turned off.
All the clients are on the same subnet.
The server is in the same subnet, if 192.x.x.x or 10.x.x.x ip addresses are used.
Make sure JMeter can access the server.
Make sure you use the same version of JMeter on all the systems. Mixing versions may not work correctly.

Might be very late in the game but still. Im running this with jmeter 5.3.
So to get it work by setting up the slaves in aws and the controller on your local machine.
Make sure your slave has the proper localports and hostname. The hostname on the slave should be the ec2 instance public dns.
Make sure AWS has proper security policies.
For the controller (which is your local machine) make sure you run with the parameter '-Djava.rmi.server.hostname='. You can get the ip by googling "my public ip address". Definately not those 192.xxx.xxx.x or 172.xx.xxx.
Then you have to configure your modem to port forward your machine that is used to be your controller. The port can be obtained when from the slave log (the ones that has the FINE: RMI RenewClean....., yeah you have to set the log to verbose). OR set DMZ and put your controller machine. Dangerous, but convinient just for the testing time, don't forget to off it after that
Then it should work.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas