As of version 4.5, Domino supports the ability to provision an on-demand Ray cluster within the same Kubernetes infrastructure that runs Domino. Below are some tips for gathering logs and beginning to troubleshoot issues with on-demand Ray in Domino.
These are the platform components responsible for the execution of on-demand Ray workloads with useful commands for checking their status and retrieving logs. Domino support will request status of these components, and the log output when troubleshooting issues with Ray. Checking these components requires access to the kubernetes cluster, and values in brackets are variable and may need to be adjusted to actual values in your cluster.
- Ray must be version 1.3 or later. Python must be 3.8 or later. Domino must be version 4.5 or later. Earlier versions are not supported.
- The compute environment for the workspace, and the compute environment for the Ray cluster must be compatible, with matching versions of Ray and Python.
#workspace compute environment image:
#ray cluster environment image:
RayCluster Custom Resource Definition:
- The RayCluster CRD (
rayclusters.distributed-compute.dominodatalab.com) must be deployed
- Useful Command:
#check if raycluster crd exists
kubectl get crd rayclusters.distributed-compute.dominodatalab.com
- An on-demand Ray cluster will be spun up within the existing k8s cluster running Domino.
- Useful Commands:
#check if ray cluster exists
kubectl -n <domino-compute> get ray
#check if ray pods are running
kubectl -n <domino-compute> get pod -l app.kubernetes.io/name=ray
Distributed Compute Operator:
- The Distributed Compute Operator is responsible for managing the lifecycle of all Kubernetes resources of the Ray cluster on behalf of the Domino workload.
- Useful Commands
#check if DCO is running:
kubectl -n <domino-compute> get pod -l app.kubernetes.io/instance=distributed-compute-operator
#check DCO pod for recent events:
kubectl -n <domino-compute> describe pod -l app.kubernetes.io/instance=distributed-compute-operator
#check DCO logging:
kubectl -n <domino-compute> logs --timestamps -l app.kubernetes.io/instance=distributed-compute-operator
Nucleus Dispatcher is responsible for managing the lifecycle of the Ray cluster through the creation and deletion of the RayCluster custom resource object.
Nucleus Frontend is responsible for the Workspace/Job Launch UI and the Ray Dashboard
- Useful Commands:
#get nucleus dispatcher pod names and check if they are running
kubectl -n <domino-platform> get pods| grep nucleus
#check nucleus dispatcher logs
kubectl -n <domino-platform> logs <nucleus-dispatcher-pod-name> --timestamps --tail 5000
#check nucleus frontend logs
kubectl logs -n <domino-platform> <nucleus-frontend-pod-name> -c nucleus-frontend --timestamps --tail 5000
This article is meant to give some insight on components to check and a head start on gathering logs that Domino support will request when troubleshooting issues with on-demand Ray in Domino. For further assistance please open a support ticket with any findings gained from the above.