Workspaces with Spark clusters attached are no longer launching, they don't progress past assigning or pending.
Logs indicate a problem with the distributed-compute-operator pod
2022-04-22 08:13:53 : Failure in processing event on attempt 1 out of 1 domino.server.dispatcher.infrastructure.KubernetesConfigurationApplicationFailed: Applying Kubernetes configuration failed with status code 1:
Error from server (InternalError): error when creating "/opt/domino-nucleus/tmp/execution-configurations/spark-6262c1ea6711b8302f69a719.yaml":
Internal error occurred: failed calling webhook "msparkcluster.kb.io":
no endpoints available for service "distributed-compute-operator-webhook-server"
The general symptom could occur for various reasons (including node capacity / autoscaling issues), but if you see the above error then you might have too little memory defined for a specific component.
If "kubectl get po -A | grep dist" reveals 0/1 containers in the pod running and a high number for Restart Count, like below,
domino-compute distributed-compute-operator-c7d6b6f8f-sg554 0/1 Running 3909 21
then describe this pod to determine the memory limits and Last State = Terminated. If limit is still set to the default 100Mi then these will need to be increased if you are using your clusters a moderate amount.
You can patch the deploy object to increase the limits to 500Mi:
kubectl patch deployment -n domino-compute distributed-compute-operator -p \
Then delete the distributed-compute-operator-nnnn pod so it will pickup the new memory limits.
This type of problem has been reported in late 4.x thru current versions of of this post which is 5.2.1.
Please sign in to leave a comment.