Connecting to a MapR cluster from Domino
At a high level, the process is as follows:
- Connect to a MapR Edge node and gather the required binaries and configuration files, then download them to your local machine.
- Upload the gathered files into a Domino project to allow access by the Domino environment builder.
- Create a new Domino environment that uses the uploaded files to enable connections to your cluster.
- Enable YARN integration for the Domino projects that you want to use with the MapR cluster.
Domino supports the following types of connections to a MapR cluster:
Create a directory named hadoop-binaries-configs at /tmp.
Copy hive-site.xml from /opt/mapr/spark/spark-<version>/conf to /tmp/hadoop-binaries-configs/. Be sure to replace the <version> string in the command below with the number that matches the folder name on your edge node.
cp /opt/mapr/spark-<version>/conf /tmp/hadoop-binaries-configs/
Copy the ssl_truststore from /opt/mapr/conf to /tmp/hadoop-binaries-configs/
cp /opt/mapr/conf/ssl_truststore /tmp/hadoop-binaries-configs/
Once you've copied the above files into /tmp/hadoop-binaries-configs, zip up the directory for transfer to your local machine.
cd /tmp tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs
Then use SCP from your local machine to download the zipped archive. After transfer, extract the files to your local filesystem and keep them handy for a future step where they will be uploaded to Domino.
On the MapR edge node, run the following command to identify the version of Java running on the cluster.
You must then download a JDK .tar file from the Oracle downloads page that matches that version. The filename will have a pattern like the following.
Keep this JDK handy for use in a future step.
Use the following procedure to upload the files you retrieved in the previous step to a public Domino project. This will make the files available to the Domino environment builder.
Log in to Domino, then create a new public project.
Open the Files page for the new project, then click to browse for files and select the files you downloaded from the MapR edge node, and the JDK you downloaded from Oracle. Then click Upload.
From the Files page of your project, click New File. Name the file run-client.sh, and in its contents you must construct an invocation of the MapR configure.sh script that is valid for setting up a client to connect to your cluster. A full explanation of how to invoke this script is beyond the scope of this document. Read the full documentation on the script from MapR, and consider the following example.
#!/bin/bash /opt/mapr/server/configure.sh -N <clustername> -c -secure -C <host1>:7222,<host2>:7222,<host3>:7222 -HS <historyServer>
Once your project contains the files from the MapR edge node, the correct JDK, and a run-client.sh script that wraps the MapR configuration script, click the gear menu next to each of those files, then right click Download and click Copy Link Address. Save these URLs in your notes, as you will need them in the next step.
Once you have recorded the download URL of the binaries and configuration files, you're ready to build a Domino environment for connecting to MapR.
Click Environments from the Domino main menu, then click Create Environment.
Give the environment an informative name, then choose a base environment that includes the version of Python that is installed on the nodes of your MapR cluster. Most Linux distributions ship with Python 2.7 by default, so you will see the Domino Analytics Distribution for Python 2.7 used as the base image in the following examples. Click Create when finished.
After creating the environment, click Edit Definition. Copy the below example into your Dockerfile Instructions, then be sure to edit it wherever necessary with values specific to your deployment and cluster.
In this Dockerfile, wherever you see a hyphenated instruction enclosed in carats like <paste-your-domino-download-url-here>, be sure to replace it with the corresponding value you recorded in previous steps. You may also need to edit commands that follow to match downloaded filenames.
# Base Image: quay.io/domino/base:Ubuntu16_DAD_Py2.7_R3.4-20180727 USER root # Give the ubuntu user ability to sudo as any user including root in the compute environment RUN echo "ubuntu ALL=(ALL:ALL) NOPASSWD: ALL" >> /etc/sudoers # Set up directories RUN mkdir /tmp/mapr-cluster-downloads && \ mkdir /usr/jdk64 # Create a mapr user and group RUN groupadd -g 5000 mapr RUN useradd -u 5000 -g mapr mapr RUN usermod -s /bin/bash mapr # Use the following wget commands to download the four files you added to Domino in the previous section. # You should have copied down the URLs to download a JDK .tar, the two files from the edge node, and the run-client.sh script you created. # The example below will use a JDK file named jdk-8u112-linux-x64.tar.gz. If you're using a different version or have a different filename, replace it wherever it occurs. RUN cd /tmp/mapr-cluster-downloads && \ wget <paste-your-run-client-dot-sh-download-url-here> -O /tmp/mapr-cluster-downloads/run-client.sh.gz && \ wget <paste-your-hive-site-dot-xml-download-url-here> -O /tmp/mapr-cluster-downloads/hive-site.xml.gz && \ wget <paste-your-jdk-tar-download-url-here> -O /tmp/mapr-cluster-downloads/jdk-8u112-linux-x64.tar.gz && \ wget <paste-your-ssl-truststore-download-url-here> -O /tmp/mapr-cluster-downloads/ssl_truststore.gz && \ gunzip run-client.sh.gz && \ gunzip hive-site.xml.gz && \ gunzip jdk-8u112-linux-x64.tar.gz && \ gunzip ssl_truststore.gz && \ cd ~ # Install Java from the JDK RUN tar xvf /tmp/mapr-cluster-downloads/jdk-8u112-linux-x64.tar -C /usr/jdk64 && \ ln -s /usr/jdk64/jdk1.8.0_112 /usr/jdk64/default ENV JAVA_HOME=/usr/jdk64/default RUN echo "export JAVA_HOME=/usr/jdk64/default" >> /home/ubuntu/.domino-defaults && \ echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/ubuntu/.domino-defaults # Install mapr-client and Spark binaries from the MapR ubuntu repository. # These examples are for MapR 6.1.0. # If you are using a different version of MapR, replace these URLs with the correct versions from http://archive.mapr.com/releases/. RUN echo "deb https://package.mapr.com/releases/v6.1.0/ubuntu binary trusty" >> /etc/apt/sources.list RUN echo "deb https://package.mapr.com/releases/MEP/MEP-6.0.0/ubuntu binary trusty" >> /etc/apt/sources.list RUN wget -O - https://package.mapr.com/releases/pub/maprgpg.key | sudo apt-key add - RUN apt-get update RUN apt-get -y install mapr-client mapr-spark mapr-hive # Copy the ssl_truststore file from the /opt/mapr/conf directory on the cluster to the /opt/mapr/conf directory on the client RUN cp /tmp/mapr-cluster-downloads/ssl_truststore /opt/mapr/conf/ # Make your customized script from the previous section executable RUN chmod +x /tmp/mapr-cluster-downloads/run-client.sh # Update SPARK and HADOOP environment variables. # Make sure the Spark and Hadoop version numbers match what is installed on your cluster # The examples below show Spark 2.3.1 and Hadoop 2.7.0. # If you are using different versions, be sure to edit the file and directory names to match. # Make sure the py4j file name is correct per your edgenode. ENV SPARK_HOME=/opt/mapr/spark/spark-2.3.1 RUN echo "export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.7.0" >> /home/ubuntu/.domino-defaults && \ echo "export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop" >> /home/ubuntu/.domino-defaults && \ echo "export YARN_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop" >> /home/ubuntu/.domino-defaults && \ echo "export SPARK_HOME=/opt/mapr/spark/spark-2.3.1" >> /home/ubuntu/.domino-defaults && \ echo "export SPARK_CONF_DIR=/opt/mapr/spark/spark-2.3.1/conf" >> /home/ubuntu/.domino-defaults && \ echo "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip" >> /home/ubuntu/.domino-defaults # Change spark configuration directory permission as a new spark-defaults.conf file gets created by Domino's spark integration RUN chmod 777 /opt/mapr/spark/spark-2.3.1/conf # Add symlinks for Spark binaries RUN ln -s /opt/mapr/spark/spark-2.3.1/bin/pyspark /usr/bin/pyspark RUN ln -s /opt/mapr/spark/spark-2.3.1/bin/spark-shell/usr/bin/spark-shell RUN ln -s /opt/mapr/spark/spark-2.3.1/bin/spark-submit /usr/bin/spark-submit # Update Java path for R RUN export LD_LIBRARY_PATH=/usr/jdk64/default/jre/lib/amd64/server && R CMD javareconf # Install Python and R JDBC packages RUN pip install jaydebeapi RUN R --no-save -e 'install.packages(c("RJDBC"))'
Scroll down to the Pre Run Script field and add the following lines, being sure to match the Spark version in the directory name to the one being set up by the Dockerfile instructions.
# Configure mapr-client with your customized script. sudo bash /tmp/mapr-cluster-downloads/run-client.sh # Copy hive-site.xml to the spark configuration directory # Be sure to match the Spark version in this folder name to match what you set up above. cp /tmp/mapr-cluster-downloads/hive-site.xml /opt/mapr/spark/spark-2.3.1/conf
Request a long-running MapR ticket from your cluster administrator, and copy its contents to your local machine. The ticket will be formatted as:
Add that token as a Domino environment variable to your Domino user account with the name USERTICKET.
Add the following lines to the bottom of the Pre Run Script field for the environment you edited previously.
## Write maprticket in environment variable to a file during runtime echo $USERTICKET > /tmp/maprticket_12574 chown ubuntu:ubuntu /tmp/maprticket_12574 chmod 600 /tmp/maprticket_12574
Note that if you do this, every user that wants to use this environment must set up a USERTICKET environment variable as described in the previous step.
Click Build when finished editing the Dockerfile instructions. If the build completes successfully, you are ready to try using the environment.
This procedure assumes that an environment with the necessary client software has been created according to the instructions above. Ask your Domino admin for access to such an environment.
- Open the Domino project you want to use with your MapR cluster, then click Settings from the project menu.
- On the Integrations tab, click to select YARN integration from the Apache Spark panel, then click Save. You should not need to edit any of the fields in this section.
- On the Hardware & Environment tab, change the project default environment to the one you built earlier with the binaries and configuration files.
You are now ready to start Runs from this project that interact with your MapR cluster.