Issue:
Workspaces with Spark clusters attached are no longer launching, they don't progress past assigning or pending.
Logs indicate a problem with the distributed-compute-operator pod
2022-04-22 08:13:53 : Failure in processing event on attempt 1 out of 1
domino.server.dispatcher.infrastructure.KubernetesConfigurationApplicationFailed: Applying Kubernetes configuration failed with status code 1:
Error from server (InternalError): error when creating "/opt/domino-nucleus/tmp/execution-configurations/spark-6262c1ea6711b8302f69a719.yaml":
Internal error occurred: failed calling webhook "msparkcluster.kb.io":
Post "https://distributed-compute-operator-webhook-server.domino-compute.svc:443/mutate-distributed-compute-dominodatalab-com-v1alpha1-sparkcluster?timeout=10s":
no endpoints available for service "distributed-compute-operator-webhook-server"
Root Cause:
The general symptom could occur for various reasons (including node capacity / autoscaling issues), but if you see the above error then you might have too little memory defined for a specific component.
If "kubectl get po -A | grep dist" reveals 0/1 containers in the pod running and a high number for Restart Count, like below,
domino-compute distributed-compute-operator-c7d6b6f8f-sg554 0/1 Running 3909 21
then describe this pod to determine the memory limits and Last State = Terminated. If limit is still set to the default 100Mi then these will need to be increased if you are using your clusters a moderate amount.
Resolution:
You can patch the deploy object to increase the limits to 500Mi:
kubectl patch deployment -n domino-compute distributed-compute-operator -p \
'{"spec":{"template":{"spec":{"containers":[{"name":"manager", "resources":{"limits":{"cpu":"150m","memory":"500Mi"}}}]}}}}'
Then delete the distributed-compute-operator-nnnn pod so it will pickup the new memory limits.
Notes/Information:
This type of problem has been reported in late 4.x thru current versions of of this post which is 5.2.1.
Comments
0 comments
Please sign in to leave a comment.