Using Dask with Domino
Recently at Coiled, we've been working with the great folks at Domino and have received some questions on behalf of Domino users about how to use Dask with Domino (which make a great combination!). In working with users over the last few months, we'd generally recommend one of the following options/levels for using Dask within Domino, depending on your needs and specific use case:
1. Use Dask (either with the default multithreaded/multiprocessing scheduler, or with the distributed scheduler) within your Domino workspace, which will use as much of the cores/RAM that are available within your workspace VM. You might also find some of the JupyterLab extensions for working with Dask to be useful.
2. Use the distributed-compute-operator provided by Domino to provision a Dask cluster in Kubernetes. This is a great way to spin up a Dask cluster on the same Kubernetes cluster where you are already working with Domino workspaces and deployments, and it offers the possibility of requesting additional CPU/RAM outside of the scope of your Domino workspace
3. Create a Dask cluster with Coiled on your preferred cloud provider and drive your Dask computations from your existing Domino Workspaces or Jobs. Because Coiled is designed to make it easy to work with remote Dask clusters, you can easily ask for additional CPU/RAM/GPUs to get more compute power for ETL jobs, exploratory data analysis, or one of the many other use cases familiar to Dask users.
# Create a Coiled cluster
import coiled
cluster = coiled.Cluster(n_workers=10, name="dask-from-domino")
# Connect Dask to Coiled cluster
from dask.distributed import Client
client = Client(cluster)
print('Dashboard:', client.dashboard_link)
# Use Dask as usual!
import dask.dataframe as dd
df = dd.read_csv('s3://.../data.csv')
df.groupby(df.account_id).balance.sum()
Let us know how you're using Dask in Domino and any followup questions that you might have!
Please sign in to leave a comment.
Comments
0 comments