Let's encrypt SSL with traefick on ECS Fargate - ssl

I've been trying to solve this for days, but without any luck:
Situation:
I have a ECS cluster on AWS using Fargate, this cluster contains an instance of Traefick 2.3.4 and other containers. I'm using Traefick as reverse proxy to forward the requests to the other containers.
Using HTTP everything works fine, so I've decided to add also the secure connection to Traefick. I've tried everything that I could find on the Internet but nothing works, when I try to connect to the specified domain with curl it returns:
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
Here there are some test that I've done:
traefick.yml:
log:
level: DEBUG
api:
dashboard: true
entryPoints:
web:
address: :80
http:
redirections:
entryPoint:
to: websecure
scheme: https
websecure:
address: ":443"
providers:
ecs:
clusters:
- tools-cluster
region: eu-west-2
exposedByDefault: false
certificatesResolvers:
letsencrypt:
acme:
caServer: https://acme-staging-v02.api.letsencrypt.org/directory
email: #########################
storage: acme.json
httpchallenge:
entrypoint: web
Labels:
"dockerLabels": {
"traefik.enable": "true",
"traefik.http.services.traefik.loadbalancer.server.port": "8080",
"traefik.http.routers.traefik.rule": "Host(`${host}`)",
"traefik.http.routers.traefik.entrypoints": "websecure",
"traefik.http.routers.traefik.tls.certresolver": "letsencrypt",
"traefik.http.routers.traefik.service": "api#internal"
}
this version returns this error:
rror: 400 :: urn:ietf:params:acme:error:connection :: Fetching https://traefik.baaluu.com/.well-known/acme-challenge/td8IdOvJ1_GkigY-jPYaA4YsgeiS5FUiuUS-avbpsuY: Error getting validation data, url
It tries to retrieve that data but it can't because it is redirected to the https and it can't retrieve because https doesn't work, I've tried also without the auto redirect, and it returns a similar error, it can't retrieve that data.
But following this guide it should work correctly.
So I've decided to move to the dnsChallenge with this configuration:
Traefick.yml
log:
level: DEBUG
api:
dashboard: true
entryPoints:
web:
address: :80
websecure:
address: ":443"
providers:
ecs:
clusters:
- tools-cluster
region: eu-west-2
exposedByDefault: false
certificatesResolvers:
letsencrypt:
acme:
caServer: https://acme-staging-v02.api.letsencrypt.org/directory
email: ######################
storage: acme.json
dnsChallenge:
provider: route53
delayBeforeCheck: 3
and same labels as before:
"dockerLabels": {
"traefik.enable": "true",
"traefik.http.services.traefik.loadbalancer.server.port": "8080",
"traefik.http.routers.traefik.rule": "Host(`${host}`)",
"traefik.http.routers.traefik.entrypoints": "websecure",
"traefik.http.routers.traefik.tls.certresolver": "letsencrypt",
"traefik.http.routers.traefik.service": "api#internal"
}
Still nothing, and I've this inside the logs:AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/170242259"
That url contains:
{
"type": "urn:ietf:params:acme:error:malformed",
"detail": "Method not allowed",
"status": 405
}
The latest test that I did is to remove the staging ca server:
log:
level: DEBUG
api:
dashboard: true
entryPoints:
web:
address: :80
websecure:
address: :443
providers:
ecs:
clusters:
- tools-cluster
region: eu-west-2
exposedByDefault: false
certificatesResolvers:
letsencrypt:
acme:
email: ###############
storage: acme.json
dnsChallenge:
provider: route53
delayBeforeCheck: 2
The ssl still doesn't work but I don't see any error message inside the logs: this is the last message that I get about a certificate:
Try to challenge certificate for domain [traefik.baaluu.com] found in HostSNI rule" providerName=letsencrypt.acme routerName=traefik#ecs rule="Host(`traefik.baaluu.com`)"
And there is not much more after that:
(I'm sorry for the picture but I don't find a way to extract that logs from ECS)
The other containers are still reachable on the http protocol.
If I try to connect to it using telnet I can reach the service:
telnet traefik.baaluu.com 443
Trying 3.8.30.164...
Connected to traefik-1547500306.eu-west-2.elb.amazonaws.com.
Escape character is '^]'.
Same goes for the 80
Looking better inside the logs I've also find this
retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/chall-v3/9205340157/1Wh0tQ :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: \"0004cbkFTGjCALFGDYOmhruMl6_F_fRSj33cOMvdpx5Xd2M\", url: "
time="2020-12-10T13:08:21Z" level=debug msg="legolog: [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/chall-v3/9205340157/1Wh0tQ :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: \"0004cbkFTGjCALFGDYOmhruMl6_F_fRSj33cOMvdpx5Xd2M\", url: "
that contains this url: https://acme-v02.api.letsencrypt.org/acme/chall-v3/9205340157/1Wh0tQ
{
"type": "dns-01",
"status": "valid",
"url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/9205340157/1Wh0tQ",
"token": "44R4gD4_ZmemiCn5rtkqJyWOcjoj09sEgobUvZLH6yc",
"validationRecord": [
{
"hostname": "traefik.baaluu.com"
}
]
}
So I suppose that the ssl has been generated correctly but I'm not sure.
Any idea or suggestion?
Thanks in advance.
H2K
Edit:
I've removed the ssl from the dashboard and I've put it on another container, now entering inside the dashboard I can see this:
So I suppose that the ssl is working for that domain, but I still can't connect to it.
Edit 2:
with telnet if I connect to that url on the port 443 and I request the page I can see the content:
telnet xxxxxxxxxxxxxxxxx 443
Trying 3.10.148.201...
Connected to traefik-1547500306.eu-west-2.elb.amazonaws.com.
Escape character is '^]'.
GET /index.html HTTP/1.1
Host: xxxxxxxxxxxxxxxxx
And the content of the page appears, so it is not a load balacer problem or routing problem, it seems that I can reach the container using the 443, simply the ssl is not there. It is like to have 2 http port and both are behaving in the same way. The 443 at the moment is like a port 80.

I've have also spent a number of days trying to work it out so i feel your pain.
The error is misleading, the request doesn't even make it past the ALB let alone traefik.
There are two factors to this issue,
The first being that when you specify a port 443 through docker compose as "443:443" you would assume that this creates a HTTPS listener, it actually creates a listener for 443 on the HTTP protocol. In addition the listener also sent the data to the fargate HTTP port and didn't redirect. I'm not sure if this is a bug, or because because i haven't specified that the protocol should be "x-aws-protocol: https" on the target port.
I also found some AWS documentation that said if you use a HTTPS port on a ALB that you need an SSL certificate in place at a ALB level. This kind of makes sense that you can't terminate the connection at a task level if you consider the swarm nature and security implications (better minds are welcome to explain)
With the above in mind i created a certificate in the ACM that covered all the the domains that i needed, changed the listener to the HTTPS protocol and specified the certificate i created. At this point i was able to configure traefik to accept traefik to the frontend.

Related

OpenShift: Pod often does not return expected requests

I'm running a .dotnet 3.1 RestAPI inside a pod in Openshift, and it process every request smoothly - all transactions to the database (outside the Openshift network) are executed properly, all programatic executions are being finalized without errors. However, 1 in 15 requests will always ECONNRESET and fail to return the HTTP request.
Let's say I make a GET to /users/id/3 - I can see this request hitting my restAPI, being processed all the way down the infra layer, fetch data from the DB, wrap the return, and send it back finishing the request, however at this point, no return is received by the frontend, or postman, but I can see on the API logs that the request was finished and returned.
All of this requests take 2.3min to execute, and often finish in a ECONNRESET. I'm at odds at how to troubleshoot this. I have tried curl'ing the resource in another pod and the same behaviour appears.
I think these requests sometimes are getting lost in the cluster network, so I tried playing with the sessionaffinity of the service config but it's not really tied to this, as far as I understood. Do I have a wrong route config, or service config?
Route config
spec:
host: api.com.cloud
to:
kind: Service
name: api
weight: 100
port:
targetPort: 8080-tcp
tls:
termination: edge
wildcardPolicy: None
status:
ingress:
- host: api.com.cloud
routerName: default
conditions:
- type: Admitted
status: 'True'
lastTransitionTime: XXXX
wildcardPolicy: None
routerCanonicalHostname: router-default.apps.com.cloud
Service config
spec:
clusterIP: XXXX
ipFamilies:
- IPv4
ports:
- name: 8080-tcp
protocol: TCP
port: 8080
targetPort: 8080
internalTrafficPolicy: Cluster
clusterIPs:
- XXX
type: ClusterIP
ipFamilyPolicy: SingleStack
sessionAffinity: None
selector:
deploymentconfig: api
status:
loadBalancer: {}

Mercure keeps binding to port 80

I'm using the Mercure hub 0.13, everything works fine on my development machine, but on my test server the hub keeps on trying to bind on port 80, resulting in a error, as nginx is already running on port 80.
run: loading initial config: loading new config: http app module: start: tcp: listening on :80: listen tcp :80: bind: address already in use
I'm starting the hub with the following command:
MERCURE_PUBLISHER_JWT_KEY=$(cat publisher.key.pub) \
MERCURE_PUBLISHER_JWT_ALG=RS256 \
MERCURE_SUBSCRIBER_JWT_KEY=$(cat publisher.key.pub) \
MERCURE_SUBSCRIBER_JWT_ALG=RS256 \
./mercure run -config Caddyfile.dev
Caddyfile.dev is as follows:
# Learn how to configure the Mercure.rocks Hub on https://mercure.rocks/docs/hub/config
{
{$GLOBAL_OPTIONS}
}
{$SERVER_NAME:localhost:3000}
log
route {
redir / /.well-known/mercure/ui/
encode zstd gzip
mercure {
# Transport to use (default to Bolt)
transport_url {$MERCURE_TRANSPORT_URL:bolt://mercure.db}
# Publisher JWT key
publisher_jwt {env.MERCURE_PUBLISHER_JWT_KEY} {env.MERCURE_PUBLISHER_JWT_ALG}
# Subscriber JWT key
subscriber_jwt {env.MERCURE_SUBSCRIBER_JWT_KEY} {env.MERCURE_SUBSCRIBER_JWT_ALG}
# Permissive configuration for the development environment
cors_origins *
publish_origins *
demo
anonymous
subscriptions
# Extra directives
{$MERCURE_EXTRA_DIRECTIVES}
}
respond /healthz 200
respond "Not Found" 404
}
When I provider the SERVER_NAME as an environment variable, without a domain, SERVER_NAME=:3000, the hub actually starts on port 3000, but runs in http mode, which only allows for anonymous subscriptions and is not what I need.
Server:
Operating System: CentOS Stream 8
Kernel: Linux 4.18.0-383.el8.x86_64
Architecture: x86-64
Full output when trying to start the Mercure hub:
2022/05/10 04:50:29.605 INFO using provided configuration {"config_file": "Caddyfile.dev", "config_adapter": ""}
2022/05/10 04:50:29.606 WARN input is not formatted with 'caddy fmt' {"adapter": "caddyfile", "file": "Caddyfile.dev", "line": 3}
2022/05/10 04:50:29.609 INFO admin admin endpoint started {"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["localhost:2019", "[::1]:2019", "127.0.0.1:2019"]}
2022/05/10 04:50:29.610 INFO http enabling automatic HTTP->HTTPS redirects {"server_name": "srv0"}
2022/05/10 04:50:29.610 INFO tls.cache.maintenance started background certificate maintenance {"cache": "0xc0003d6150"}
2022/05/10 04:50:29.627 INFO tls cleaning storage unit {"description": "FileStorage:/root/.local/share/caddy"}
2022/05/10 04:50:29.628 INFO tls finished cleaning storage units
2022/05/10 04:50:29.642 INFO pki.ca.local root certificate is already trusted by system {"path": "storage:pki/authorities/local/root.crt"}
2022/05/10 04:50:29.643 INFO tls.cache.maintenance stopped background certificate maintenance {"cache": "0xc0003d6150"}
run: loading initial config: loading new config: http app module: start: tcp: listening on :80: listen tcp :80: bind: address already in use
I'm a bit late, but I hope that will help someone.
As mentionned here, you can specify the http_port manually in your caddy configuration file.

Connection refused when using load balancing with Traefik 2

I have yml configuration file with a router and a service. Every time I get a 404 error. I know the URL works and I can access the server from Traefik server. What am I missing? Also, for some reason the request reroutes to https. Perhaps a conflicting rule?
Also note, Traefik runs in docker, but the connecting server does not. The goal here is to add multiple nodes to the load balancer.
http:
routers:
demo_1-rtr:
rule: "Host(`http://demo.lab.local`)"
service: demo_1
entryPoints:
- http
services:
demo_1:
loadBalancer:
servers:
- url: "http://172.16.9.90:16000"
Traefik Config:
global:
checkNewVersion: true
sendAnonymousUsage: true
api:
insecure: true
providers:
docker:
endpoint: "unix://var/run/docker.sock"
exposedByDefault: false
file:
directory: /rules
watch: true
log:
level: DEBUG
accessLog: {}
entryPoints:
http:
address: ":80"
I suspect it would be this
--api.insecure=true global argument and it should work.
So in your case add the following in traefik.toml
[api]
insecure = true
Otherwise I would need more information to debug more.

When using "tls-alpn-01" challenge for let's encrypt certs in kubernetes using traefik, I'm getting "acme: error: 400 Timeout during connect"

I'm following the tutorial to use traefik as the ingress and ingress controller for Azure Kubernetes Service (AKS) cluster. I'm using terraform to deploy the traefik (version 1.7.24) helm chart.
resource "helm_release" "traefik" {
name = "traefik"
namespace = "traefik"
repository = "https://charts.helm.sh/stable"
chart = "traefik"
version = "1.87.2"
values = [<<EOF
loadBalancerIP: "50.100.200.300"
service:
annotations:
service.beta.kubernetes.io/azure-load-balancer-resource-group: "aks-rg"
kubernetes:
ingressClass: traefik
ingressEndpoint:
useDefaultPublishedService: true
dashboard:
enabled: true
domain: traefik.mydomain.tld
ingress:
annotations:
kubernetes.io/ingress.class: traefik
metrics:
serviceMonitor:
enabled: true
rbac:
enabled: true
ssl:
enabled: true
enforced: true
acme:
enabled: true
email: admin#mydomain.tld
staging: true
tlsChallenge: true
entrypoint: https
ports: "443:443"
challengeType: tls-alpn-01
onHostRule: true
domains:
enabled: true
domainsList:
- main: "mydomain.tld"
- sans:
- "traefik.mydomain.tld"
EOF
]
}
The DNS records are correctly pointed to the AKS Load Balancer IP.
When I check the traefik logs, I could see that "tls-alpn-01" challenge fails with the following error:
{"level":"error","msg":"Unable to obtain ACME certificate for domains \"mydomain.tld,traefik.mydomain.tld\" : unable to generate a certificatefor the domains [mydomain.tld traefik.mydomain.tld]: acme: Error -\u003e One or more domains had a problem:\n[mydomain.tld] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during connect (likely firewall problem), url: \n[traefik.mydomain.tld] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during connect (likely firewall problem), url: \n","time":"2021-02-26T02:32:05Z"}
Full log is given below:
{"level":"info","msg":"Using TOML configuration file /config/traefik.toml","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"No tls.defaultCertificate given for https: using the first item in tls.certificates as a fallback.","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Traefik version v1.7.24 built on 2020-03-25_04:34:11PM","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/v1.7/basics/#collected-data\n","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Preparing server traefik \u0026{Address::8080 TLS:\u003cnil\u003e Redirect:\u003cnil\u003e Auth:\u003cnil\u003e WhitelistSourceRange:[] WhiteList:\u003cnil\u003e Compress:false ProxyProtocol:\u003cnil\u003e ForwardedHeaders:0xc000851700} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Preparing server http \u0026{Address::80 TLS:\u003cnil\u003e Redirect:0xc0000b1b80 Auth:\u003cnil\u003e WhitelistSourceRange:[] WhiteList:\u003cnil\u003e Compress:true ProxyProtocol:\u003cnil\u003e ForwardedHeaders:0xc0008516a0} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Starting server on :8080","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Preparing server https \u0026{Address::443 TLS:0xc000431e60 Redirect:\u003cnil\u003e Auth:\u003cnil\u003e WhitelistSourceRange:[] WhiteList:\u003cnil\u003e Compress:true ProxyProtocol:\u003cnil\u003e ForwardedHeaders:0xc0008516c0} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Starting provider configuration.ProviderAggregator {}","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Starting server on :80","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Starting server on :443","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Starting provider *kubernetes.Provider {\"Watch\":true,\"Filename\":\"\",\"Constraints\":[],\"Trace\":false,\"TemplateVersion\":0,\"DebugLogGeneratedTemplate\":false,\"Endpoint\":\"\",\"Token\":\"\",\"CertAuthFilePath\":\"\",\"DisablePassHostHeaders\":false,\"EnablePassTLSCert\":false,\"Namespaces\":null,\"LabelSelector\":\"\",\"IngressClass\":\"traefik\",\"IngressEndpoint\":{\"IP\":\"\",\"Hostname\":\"\",\"PublishedService\":\"traefik/traefik\"},\"ThrottleDuration\":0}","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"ingress label selector is: \"\"","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Creating in-cluster Provider client","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Starting provider *acme.Provider {\"Email\":\"admin#mydomain.tld\",\"ACMELogging\":false,\"CAServer\":\"https://acme-staging-v02.api.letsencrypt.org/directory\",\"Storage\":\"/acme/acme.json\",\"EntryPoint\":\"https\",\"KeyType\":\"RSA4096\",\"OnHostRule\":true,\"OnDemand\":false,\"DNSChallenge\":null,\"HTTPChallenge\":null,\"TLSChallenge\":{},\"Domains\":[{\"Main\":\"mydomain.tld\",\"SANs\":[\"traefik.mydomain.tld\"]}],\"Store\":{}}","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Testing certificate renew...","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Server configuration reloaded on :8080","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Server configuration reloaded on :80","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Server configuration reloaded on :443","time":"2021-02-26T02:31:38Z"}
{"level":"info","msg":"Server configuration reloaded on :443","time":"2021-02-26T02:31:39Z"}
{"level":"info","msg":"Server configuration reloaded on :8080","time":"2021-02-26T02:31:39Z"}
{"level":"info","msg":"Server configuration reloaded on :80","time":"2021-02-26T02:31:39Z"}
{"level":"info","msg":"Register...","time":"2021-02-26T02:31:42Z"}
{"level":"info","msg":"Updated status on ingress traefik/traefik-dashboard","time":"2021-02-26T02:31:42Z"}
{"level":"info","msg":"Server configuration reloaded on :443","time":"2021-02-26T02:31:55Z"}
{"level":"info","msg":"Server configuration reloaded on :8080","time":"2021-02-26T02:31:55Z"}
{"level":"info","msg":"Server configuration reloaded on :80","time":"2021-02-26T02:31:55Z"}
{"level":"error","msg":"Unable to obtain ACME certificate for domains \"mydomain.tld,traefik.mydomain.tld\" : unable to generate a certificatefor the domains [mydomain.tld traefik.mydomain.tld]: acme: Error -\u003e One or more domains had a problem:\n[mydomain.tld] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during connect (likely firewall problem), url: \n[traefik.mydomain.tld] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during connect (likely firewall problem), url: \n","time":"2021-02-26T02:32:05Z"}
I have even added a "AllowAll" rule in AKS LoadBalancer NSG (firewall). But still tls-alpn-01 validation is facing the timeout error. The ssl certificate generation doesn't happen and my website is using the default example.com expired ssl certificate.
I can confirm that telnet to port 443 of mydomain.tld also works fine.
PS: I don't want to use "dns-01" challenge for ssl certificate as the dns provider doesn't have the APIs for let's encrypt. I can't use "http-01" as these are backend servers which do not have any web server.
Any help is much appreciated. I would also like to know how the tls-alpn-01 challenge works.

Lets Encrypt DNS challenge using HTTP

I'm trying to setup a Let's Encrypt certificate on Google Cloud. I recently changed it from http01 to dns01 challenge type so that I could create Cloud DNS zones and the acme challenge TXT record would automatically be added.
Here's my certificate.yaml
apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
name: san-tls
namespace: default
spec:
secretName: san-tls
issuerRef:
name: letsencrypt
commonName: www.evolut.net
altNames:
- portal.evolut.net
dnsNames:
- www.evolut.net
- portal.evolut.net
acme:
config:
- dns01:
provider: clouddns
domains:
- www.evolut.net
- portal.evolut.net
However now I get the following error when I kubectl describe certificate:
Message: DNS names on TLS certificate not up to date: ["portal.evolut.net" "www.evolut.net"]
Reason: DoesNotMatch
Status: False
Type: Ready
More worryingly, when I kubectl describe order I see the following:
Status:
Challenges:
Authz URL: https://acme-v02.api.letsencrypt.org/acme/authz/redacted
Config:
Http 01:
Dns Name: portal.evolut.net
Issuer Ref:
Kind: Issuer
Name: letsencrypt
Key: redacted
Token: redacted
Type: http-01
URL: https://acme-v02.api.letsencrypt.org/acme/challenge/redacted
Wildcard: false
Authz URL: https://acme-v02.api.letsencrypt.org/acme/authz/redacted
Config:
Http 01:
Notice how the Type is always http-01, although in the certificate they are listed under dns01.
This means that the ACME TXT file is never created in Cloud DNS and of course the domains aren't validated.
This seems to be related an issue related to the use of multiple domains. I suggest the use of two different namespaces. You can check an example in the following link:
Failed to list *v1alpha1.Order: orders.certmanager.k8s.io is forbidden