If you encounter a scenario where the nucleus frontend and dispatcher pods are stuck in a CrashLoopBackOff state, and you find entries in their logs such as:
Commands:
kubectl logs -n <domino-platform-namespace> nucleus-dispatcher-<id> --tail 1000
kubectl logs -n <domino-platform-namespace> nucleus-frontend-<id> -c nucleus-frontend --tail 1000
Logs:
1) Error in custom provider, org.quartz.SchedulerConfigException: Error while initializing the indexes
[See nested exception: com.mongodb.MongoWriteConcernException: waiting for replication timed out]
This is pointing to an issue with replication in at least one of the mongodb-replicaset pods. A tell-tale sign will be a mongodb-replicaset pod with a high restart count, and entries such as this in its logs:
Commands:
kubectl logs -n <domino-platform-namespace> mongodb-replicaset-<id> -c mongodb-replicaset --tail 1000
Nucleus services are dependent on mongoDB and therefore will not start if replication is failing in the mongodb statefulset.
To recover from this situation and get the nucleus services started, one must terminate the failing mongodb-replicaset pod by scaling the mongoDB stateful set down to the number of working pods, usually 2, and then back up to the original number of pods, usually 3 in most deployments.
kubectl get statefulsets -n <domino-platform-namespace> | grep mongo
kubectl scale statefulsets -n <domino-platform-namespace> mongodb-replicaset --replicas=2
kubectl get pods -n <domino-platform-namespace> | grep mongodb-replicaset
kubectl scale statefulsets -n <domino-platform-namespace> mongodb-replicaset --replicas=3
kubectl get pods -n <domino-platform-namespace> | grep mongodb-replicaset
kubectl get pods -n <domino-platform-namespace> | grep nucleus
kubectl delete pod -n <domino-platform-namespace> <nucleus-service-name>
Comments
0 comments
Please sign in to leave a comment.