Hadoop and Spark overview
Overview
Apache Hadoop is a collection of open source cluster computing tools that supports popular applications for data science at scale, such as Spark.
You can interact with Hadoop from your Domino executors by configuring your Domino environment with the necessary software dependencies and credentials. Domino supports most providers of Hadoop solutions, including MapR, Cloudera, and Amazon EMR. Once a Domino environment is set up to connect to your cluster, Domino projects can use the environment to work with Hadoop applications.
For a thorough video introduction to using Spark in Domino, watch the webinar recording:
Using a Hadoop-enabled environment in your Domino project
If your Domino administrators have already created an environment for connecting to a Hadoop cluster, you can follow these subsections of the setup instructions to use that environment in your Domino project.
For users setting up projects to work with an existing environment, read these subsections:
- Configuring a Domino project for use with a Cloudera CDH5 cluster
- Configuring a Domino project for use with an Amazon EMR cluster
- Configuring a Domino project for use with a MapR cluster
- Configuring a Domino project for use with a Hortonworks cluster
Once your project is set up to use the environment, you can execute code in your Domino Runs that connects to the cluster for Spark, HDFS, or Hive functionality.
Setting up Domino to connect to a new Hadoop cluster
To connect to your existing Hadoop cluster from Domino, you must create a Domino environment with the necessary dependencies installed. Some of these dependencies, including binaries and configuration files, will come directly from the cluster itself. Others will be external software dependencies like Java and Spark, and you will need to match the version you install in the environment to the version running on the cluster.
The basic steps for setting up an environment to connect to your cluster are:
- Gather binaries and configuration files from your cluster
- Gather dependencies from external sources, like Java JDKs and Spark binaries
- Upload all dependencies to a Domino project, to make them accessible to the Domino environment builder
- Author a new Domino environment that pulls from the Domino project, then installs and configures all required dependencies
For Domino admins setting up a Domino environment to connect to a new cluster, read the full provider-specific setup guides:
- Connecting to a Cloudera CDH5 cluster from Domino
- Connecting to an Amazon EMR cluster from Domino
- Connecting to a MapR cluster from Domino
- Connecting to a Hortonworks cluster from Domino
Additional capabilities
Domino also supports running Spark on a Domino executor in local mode, querying Hive tables with JDBC, and authenticating to clusters with Kerberos. See the following guides for more information.
Comments
0 comments
Please sign in to leave a comment.