Version:
Domino v5.2.0+
Issue:
Symptoms in the UI will be that Compute Environment builds never leave queued. A kubernetes admin will note that the logs from the hephaestus-manager pod will include the error:
X509: certificate has expired or is not yet valid
This failure will require intervention from an admin with kubernetes access. Reach out to Domino support if you are uncertain who your internal administrators are.
Root Cause:
The hephaestus builder is the v3 builder in Domino that was released with Domino v5.2. If it is unable to renew its cert it may expire and cause builds to fail to initiate.
We have seen two failure modes for this, the cert-manager pod was OOMkilled and in another instance two separate cert-manager pods were running in the deployment which caused a race condition. There may be other failure modes that have not yet been documented as well.
Resolution:
Steps to recover from this failure are to first stabilize your cert-manager. If your cert-manager has been OOMKilled, edit the cert-manager deployment and increase the memory limit. If you find two cert-manager's are deployed, you may need to reach out to Domino support for assistance (see note on this below).
Once you have stabilized the cert-manager you will need to refresh the kubernetes secret and restart the hephaestus deployment, buildkit, and nucleus.
kubectl delete secret -n domino-platform hephaestus-webhook-tls
kubectl -n domino-platform rollout restart deploy hephaestus-manager
Wait for hephaestus manager to restart before proceeding then scale buildkit to 0 and back to 5
kubectl scale sts -n domino-compute hephaestus-buildkit --replicas=0
Wait for hephaestus-buildkit pots to terminate than scale back to 5.
kubectl scale sts -n domino-compute hephaestus-buildkit --replicas=5
A final step is to restart nucleus services.
Prior to Domino 5.6.2 restart of the Domino nucleus-dispatcher can kill existing workloads and the restart should not be done if critical workloads are running at these versions.
Nucleus services can always be restarted from the Domino UI, Admin>>Advanced>>Restart Services
. This will do a rolling restart of all 3 of Domino's nucleus services including the dispatcher.
But if you are already on the command line, you can execute the following restarts.
kubectl rollout restart deploy nucleus-frontend -n domino-platform
kubectl rollout restart deploy nucleus-dispatcher -n domino-platform #(see caution above)
kubectl rollout restart deploy nucleus-train -n domino-platform
Notes/Information:
When looking at whether you have two cert manager's running, do not be confused by a list of cert-manager pods that looks like the following, this is normal.
kubectl get po -A | grep cert-manager
cert-manager-6cfc95b946-lg9ds
cert-manager-cainjector-6cdb78646f-28vqk
cert-manager-webhook-75bfcb95c8-jdpfr
custom-cert-manager-68cdd9477c-tvz2r
What you should look for are multiple cert-manager deployments
kubectl get deployments -A -l 'app.kubernetes.io/instance=cert-manager'
If this returns more than one cert-manager deployment, this can create a race condition and failures in the build manager. If you are uncertain how to resolve multiple cert-manager deployments, reach out to support for assistance.
Comments
0 comments
Please sign in to leave a comment.