I recently ran into an interesting issue with my home Kubernetes environment that runs my blog. As I mentioned in a previous post, I run my blog on k3s and I use cert-manager to manage my SSL certificates provided by Let’s Encrypt. Let’s say that I’ve temporarily changed my Internet provider and along with it, my router. This router does not appear to support NAT Loopback. The cert-manager documentation acknowledges the issue but doesn’t provide much of a solution. Cert-manager couldn’t renew my blog’s certificate because its self-check kept failing. I managed to solve the issue through a fairly simple CoreDNS change. Let’s take a look.
It took me a little while to figure out what the issue was. I saw the certificate was pending and looking at the Challenge
object, I saw this:
1 2 3 4 5 6 7 8 |
$ kubectl -n blog describe challenges ... Status: Presented: true Processing: true Reason: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://therubyist.org/.well-known/acme-challenge/... connect: connection timed out State: pending |
I could curl
the URL in question from an external machine, so I know it was up and responding all the way through to the cert-manager ACME responder Ingress and Pod. From the host Linux server, curl
also worked. I narrowed down the problem by trying curl
from within the blog’s Pod, which had the same timeout problem. After some searching, I managed to locate the cert-manager documentation about issues with external load balancers. While it wasn’t an exact match, it helped me hone in on the actual issue. I went back to trying the Linux host, but this time adding a -v
to the curl
command. It was connecting to the Linux server’s private IP address, while the Challenge
timeout indicated it was trying to connect to the public IP.
I tried a curl
to the public IP (the same IP the Challenge
was trying to self-check) and it had the same timeout problem. This was helpful because now I could test the problem independently of Kubernetes. The reason it worked for the actual name (rather than the IP) is because of my /etc/hosts
file entry for the domain.
Potential Solutions
My first thought was to disable the self-check. Surely this isn’t the only reason that someone would want to disable this self-check, but this seems like as good a reason as any. Sadly, despite many people suggesting the option on similar issues, the maintainers of cert-manager really like self-checks. Short of forking cert-manager, I couldn’t find a way to disable self-checks.
Other people with the issue went as far as creating specialized proxies to work around it. I wasn’t keen on the idea of running another service to solve this problem.
Another solution I saw someone use involved modifying the ingress resource to trick Kubernetes into DNATing traffic to the external IP. This seemed like too much of a hack to me, plus it required me to adjust it whenever the ISP decided to give me a new IP.
I saw people mentioning CoreDNS and how it could be made to rewrite lookups. This idea seemed like it would work. If I rewrote the DNS queries so they pointed to my Ingress controller’s Service, it should do the trick. Note that messing around in the kube-system
namespace tends to be a bad idea. I knew what I was doing and this is a pretty benign change, but you’ve been warned. Here’s what I changed:
1 2 |
kubectl -n kube-system edit configmap/coredns |
I edited the Corefile:
section to add rewrites for my domains. I found the line that said health
and added this below it:
1 2 |
rewrite name therubyist.org traefik.kube-system.svc.cluster.local |
Once I added lines like that for all the domains I needed, I saved the file then killed the CoreDNS Pod.
1 2 |
$ kubectl -n kube-system delete pods/coredns-66c464876b-lflpg |
This triggered a new Pod to be created and the cluster’s DNS now pointed therubyist.org
to the right place. I tested my curl
from within the blog’s Pod and it worked! By the time I checked on the Challenge
resource again, it had already completed successfully. My new cert was out there and usable:
1 2 3 4 |
$ kubectl -n blog get certificates NAME READY SECRET AGE blog-gnagy-info-tls True blog-gnagy-info-tls 15m |
Additional Notes
While this works, whenever k3s is restarted (e.g., the server restarts or I upgrade k3s), this ConfigMap is reset to the original contents (per this GitHub issue). I haven’t yet found the perfect workaround for this, but I’ll update the post when I find something I like.
Another option is to use dns validation instead of http for cert manager.
Absolutely, though I currently have my DNS for my home stuff (including this blog) through GoDaddy… so no easy integration to let cert-manager do its thing for DNS. Unless you know of a way to that, in which case I’d gladly switch. I’ve considered transferring my domains somewhere else like Route53 to make it easier but I never get around to it.
Nice article ! Good job !