Workspaces with Spark clusters attached are no longer launching, they don't progress past assigning or pending.
Logs indicate a problem with the distributed-compute-operator pod
2022-04-22 08:13:53 : Failure in processing event on attempt 1 out of 1 domino.server.dispatcher.infrastructure.KubernetesConfigurationApplicationFailed: Applying Kubernetes configuration failed with status code 1:
Error from server (InternalError): error when creating "/opt/domino-nucleus/tmp/execution-configurations/spark-6262c1ea6711b8302f69a719.yaml":
Internal error occurred: failed calling webhook "msparkcluster.kb.io":
no endpoints available for service "distributed-compute-operator-webhook-server"
The general symptom could occur for various reasons (including node capacity / autoscaling issues), but if you see the above error then you might have too little memory defined for a specific component.
If "kubectl get po -A | grep dist" reveals 0/1 containers in the pod running and a high number for Restart Count, like below,
domino-compute distributed-compute-operator-c7d6b6f8f-sg554 0/1 Running 3909 21
then describe this pod to determine the memory limits and Last State = Terminated. If limit is still set to the default 100Mi then these will need to be increased if you are using your clusters a moderate amount.
You can patch the deploy object to increase the limits to 500Mi:
kubectl patch deployment -n domino-compute distributed-compute-operator -p \
Then delete the distributed-compute-operator-nnnn pod so it will pickup the new memory limits.
This type of problem has been reported in late 4.x thru current versions of of this post which is 5.2.1.