Version/Environment (if relevant):
This applies to deploys of Domino 4.6.x, 5.0.x, 5.1.x, 5.2.0, 5.2.1, 5.2.2, and 5.3.0 that are using Kubernetes version 1.21 or higher.
Issue:
Domino suddenly becomes unusable for all users. Workspaces, project pages, account pages, etc are largely unusable due to 500 and 40x errors.
Workspace launch may display:
Viewing a Model results in:
For further confirmation run kubectl logs on the frontend and vault pods. The frontend logs will contain com.bettercloud.vault.VaultException with messages like:
WARN [d.s.l.p.ExceptionLoggingFilter] Unchecked exception thread="application-akka.actor.default-dispatcher-200" 2com.bettercloud.vault.VaultException: Vault responded with HTTP status code: 500
and! @7onoa7a30 - Internal server error, for (GET) /account -> 2thread="application-akka.actor.default-dispatcher-206" 3play.api.UnexpectedException: Unexpected exception[VaultException: Vault responded with HTTP status code: 403 Response body: {"errors":["permission denied"]}
Vault logs will contain messages like:2022-08-10T17:50:18.787Z [ERROR] auth.kubernetes.auth_kubernetes_f2baacdb: login unauthorized due to: lookup failed: service account unauthorized; this could mean it has been deleted or recreated with a new token
Root Cause:
Kubernetes 1.21 has changed how service account tokens are mounted ( https://particule.io/en/blog/vault-1.21/), resulting in component problems after 90 days in EKS and 365 days in non-EKS
In addition to the Vault component problems fluentd (run logging) and newrelic-logging will fail to communicate with the Kubernetes API too.
Note- Domino version 4.6.x and 5.3.0 won't have the impactful Vault problems, but they will have the fluentd (run logging) and newrelic-logging problems.
Resolution:
- Temporary relief can be achieved by restarting the Vault pods and potentially your fluentd, dmm-plier, and newrelic-logging pods:
kubectl delete pod <your namespace> <your pod name>
- For permanent relief, to ensure your token problems don't reoccur after a period of time, you can download the attached deploy-k8s-121-remediation.yaml and run:
kubectl apply -n <domino platform namespace> -f deploy-k8s-121-remediation.yaml
*If you are on version 4.6 then download the 4.6-deploy-k8s-121-remediation-novault.yaml file and apply it instead.
** If you use Domino Model Monitoring (aka DMM or IMM) for permanent relief you must manually (or cron) restart the dmm-plier pod every 90 days "kubectl rollout restart -n domino-platform deployment dmm-plier", the remediation.yaml object doesn't yet help DMM.
The running pod above won't require any downtime for your users, and overall should take around five minutes. On a periodic basis it modifies & confirms a specific configuration in the Vault configmap and restarts fluentd & newrelic pods. You can verify it's success via logs:
$ kubectl logs -n <domino platform namespace> kubernetes-121-remediation-5f4d8cb867-gkrk4
Running loop every 5m0s
Deleting pods after 1440h0m0s
Checking vault config for iss validation fix
Vault config already has disable_iss_validation set; skipping update
Updating Vault images and replicas
[vault] statefulset is already updated; skipping
Checking pods with labels app.kubernetes.io/name=fluentd
[fluentd-g5dqv] found pod
[fluentd-g5dqv] creation timestamp 2022-09-03 05:36:17 +0000 UTC still ok; skipping
Checking pods with labels app.kubernetes.io/name=newrelic-logging
[newrelic-logging-prgvf] found pod
[newrelic-logging-prgvf] creation timestamp 2022-09-03 05:35:47 +0000 UTC still ok; skipping
Note 2, the remediation object will result in a reduction of the number of running replicas for Vault to1
. Although the decreased number of instances is not ideal, this is intentional to mitigate Vault-to-Vault communication issues that may occur.
If you have questions please contact Domino Technical support (for Support: internal reference).
Comments
0 comments
Please sign in to leave a comment.