Version/Environment (if relevant):
This applies to deploys of Domino 4.6.x, 5.0.x, 5.1.x, 5.2.0, 5.2.1, 5.2.2, and 5.3.0 that are using Kubernetes version 1.21 or higher.
Issue:
Domino suddenly becomes unusable for all users. Workspaces, project pages, account pages, etc. are largely unusable due to 500 and 40x errors.
Workspace launch may display:
Viewing a Model results in:
For further confirmation run kubectl logs on the frontend and vault pods. The frontend logs will contain com.bettercloud.vault.VaultException with messages like:
WARN [d.s.l.p.ExceptionLoggingFilter] Unchecked exception thread="application-akka.actor.default-dispatcher-200" 2com.bettercloud.vault.VaultException: Vault responded with HTTP status code: 500
and! @7onoa7a30 - Internal server error, for (GET) /account -> 2thread="application-akka.actor.default-dispatcher-206" 3play.api.UnexpectedException: Unexpected exception[VaultException: Vault responded with HTTP status code: 403 Response body: {"errors":["permission denied"]}
Vault logs will contain messages like:2022-08-10T17:50:18.787Z [ERROR] auth.kubernetes.auth_kubernetes_f2baacdb: login unauthorized due to: lookup failed: service account unauthorized; this could mean it has been deleted or recreated with a new token
Root Cause:
Kubernetes 1.21 has changed how service account tokens are mounted ( https://particule.io/en/blog/vault-1.21/), resulting in component problems after 90 days in EKS and 365 days in non-EKS
In addition to the Vault component problems, fluentd (run logging) and newrelic-logging will also fail to communicate with the Kubernetes API.
Note: Domino version 4.6.x and 5.3.0 won't have the impactful Vault problems, but they will have the fluentd (run logging) and newrelic-logging problems.
Resolution:
Temporary Workaround
Relief can be achieved temporarily by restarting the Vault pods and potentially your fluentd, dmm-plier, and newrelic-logging pods:
kubectl delete pod <your namespace> <your pod name>
Permanent Solution
For permanent relief, to ensure your token problems don't reoccur after a period of time, you can download the attached deploy-k8s-121-remediation.yaml and run:
kubectl apply -n <domino platform namespace> -f deploy-k8s-121-remediation.yaml
*If you are on version 4.6 then download the 4.6-deploy-k8s-121-remediation-novault.yaml file and apply it instead.
The running pod created by the yaml above won't require any downtime for your users, and overall should take around five minutes. On a periodic basis it modifies & confirms a specific configuration in the Vault configmap and restarts fluentd & newrelic pods. You can verify it's success via logs:
$ kubectl logs -n <domino platform namespace> kubernetes-121-remediation-5f4d8cb867-gkrk4
Running loop every 5m0s
Deleting pods after 1440h0m0s
Checking vault config for iss validation fix
Vault config already has disable_iss_validation set; skipping update
Updating Vault images and replicas
[vault] statefulset is already updated; skipping
Checking pods with labels app.kubernetes.io/name=fluentd
[fluentd-g5dqv] found pod
[fluentd-g5dqv] creation timestamp 2022-09-03 05:36:17 +0000 UTC still ok; skipping
Checking pods with labels app.kubernetes.io/name=newrelic-logging
[newrelic-logging-prgvf] found pod
[newrelic-logging-prgvf] creation timestamp 2022-09-03 05:35:47 +0000 UTC still ok; skipping
Note: the remediation object will result in a reduction of the number of running replicas for Vault to1
. Although the decreased number of instances is not ideal, this is intentional to mitigate Vault-to-Vault communication issues that may occur.
Special Note for Domino Model Monitoring:
If you use Domino Model Monitoring (aka DMM or IMM), this issue will manifest as data ingestions stuck in Pending, with errors in dmm-plier similar to:
2023-08-25 14:42:56,194 - MainThread - aiohttp.server:405 - ERROR - Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 435, in _handle_request
resp = await request_handler(request)
...
kubernetes_asyncio.client.exceptions.ApiException: (401)
Reason: Unauthorized
HTTP response headers: <CIMultiDictProxy('Audit-Id': 'f964c633-a57c-4345-9ca6-963b4c8b0cbf', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 25 Aug 2023 14:42:56 GMT', 'Content-Length': '129')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
The resolution provided above won't provide a permanent fix. To address this issue for DMM, you must manually (or via cronjob) restart the dmm-plier pod every 90 days:
kubectl rollout restart -n domino-platform deployment dmm-plier
If you have questions please contact Domino Technical support (for Support: internal reference).
Comments
0 comments
Please sign in to leave a comment.