This article is meant for customers who use AWS Glue Catalog as their Metastore for distributed computing and would like to be able to continue using it with Domino. AWS Glue Catalog is a fully managed ETL service that makes it easy to prepare and load data for analytics. This is AWSs version of Apache Hive Metastore, but unfortunately it is not compatible with Spark out of the box so you have to install a special client, AWS Glue Catalog Client for Apache Hive Metastore, to make them compatible. In order to get this client working in Domino, you need to build out two new compute environments: one for your workspace (set in Project Settings) and one for the Spark cluster executors (set in Workspace cluster configs).
The diagram above describes how Domino's custom Glue setup compares to Apache Spark’s official releases. The most recent version of Apache Spark is v2.4.6. This release typically ships with JARs for the latest Apache Hadoop v2.7.x, as well as an Apache Spark branch of Apache Hive based off of v1.2.1. In addition, hadoop-aws relies on aws-java-sdk, typically v1.7.x.
The Glue Catalog client provided by AWS, however, introduces two constraints:
- It relies on apply a patch to Apache Hive to support custom metastore clients.
- It requires a much more recent version of aws-java-sdk (v1.11.267+).
The first constraint requires a compilation of Apache Hive with the patch applied and therefore a custom Spark distribution for use on Domino. The second constraint requires the use of Apache Hadoop v2.9.x+, which depends on AWS Java SDK 1.11.x+.
Below is a Dockerfile (adapted from here) to build this custom distribution of Apache Spark. You will need to copy the resulting artifact out of the image when done (/opt/spark/spark.tgz). The image is currently hosted as an artifact in a public S3 bucket.
FROM python:3.6-slim-buster # ADD REPO FOR JDK RUN echo "deb http://ftp.us.debian.org/debian sid main" >> /etc/apt/sources.list \ && apt-get update \ && mkdir -p /usr/share/man/man1 # INSTALL PACKAGES RUN apt-get install -y git wget openjdk-8-jdk # INSTALL MAVEN ENV MAVEN_VERSION=3.6.3 RUN cd /opt \ && wget https://downloads.apache.org/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz \ && tar zxvf /opt/apache-maven-${MAVEN_VERSION}-bin.tar.gz \ && rm apache-maven-${MAVEN_VERSION}-bin.tar.gz ENV PATH=/opt/apache-maven-$MAVEN_VERSION/bin:$PATH ENV SPARK_VERSION=2.4.6 ENV HADOOP_VERSION=2.9.2 ENV HIVE_VERSION=1.2.1 ENV AWS_SDK_VERSION=1.11.267 # BUILD HIVE FOR HIVE v1 RUN git clone https://github.com/apache/hive.git /opt/hive WORKDIR /opt/hive RUN git checkout tags/release-$HIVE_VERSION -b rel-$HIVE_VERSION # Apply patch RUN wget https://issues.apache.org/jira/secure/attachment/12958417/HIVE-12679.branch-1.2.patch RUN patch -p0 <HIVE-12679.branch-1.2.patch # Install fails to fetch this JAR. RUN wget https://repository.jboss.org/maven2/javax/jms/jms/1.1/jms-1.1.jar RUN mvn install:install-file -DgroupId=javax.jms -DartifactId=jms -Dversion=1.1 -Dpackaging=jar -Dfile=jms-1.1.jar # Build Hive RUN mvn clean install -DskipTests -Phadoop-2 # Related to this issue https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/pull/14 RUN mkdir -p ~/.m2/repository/org/spark-project RUN cp -r ~/.m2/repository/org/apache/hive ~/.m2/repository/org/spark-project # BUILD AWS GLUE DATA CATALOG CLIENT RUN git clone https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore.git /opt/glue WORKDIR /opt/glue RUN sed -i '/<packaging>pom<\/packaging>/a <dependencies><dependency><groupId>org.apache.hadoop<\/groupId><artifactId>hadoop-common<\/artifactId><version>${hadoop.version}<\/version><scope>provided<\/scope><\/dependency><\/dependencies>' shims/pom.xml RUN mvn clean package -DskipTests -pl -aws-glue-datacatalog-hive2-client # BUILD SPARK RUN git clone https://github.com/apache/spark.git /opt/spark WORKDIR /opt/spark RUN git checkout tags/v$SPARK_VERSION -b v$SPARK_VERSION RUN ./dev/make-distribution.sh --name my-custom-spark --pip -Phadoop-${HADOOP_VERSION%.*} -Phive -Dhadoop.version=$HADOOP_VERSION -Dhive.version=$HIVE_VERSION # ADD MISSING & BUILT JARS TO SPARK CLASSPATHS + CONFIG WORKDIR /opt/spark/dist # Copy missing deps RUN mvn dependency:get -Dartifact=asm:asm:3.2 RUN mvn dependency:get -Dartifact=commons-codec:commons-codec:1.9 # Copy Glue Client JARs RUN find /opt/glue -name "*.jar" -exec cp {} jars \; # Copy AWS JARs RUN echo :quit | ./bin/spark-shell --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:$HADOOP_VERSION,com.amazonaws:aws-java-sdk:$AWS_SDK_VERSION RUN cp /root/.ivy2/jars/*.jar jars # CREATE ARTIFACT RUN mv /opt/spark/dist /opt/spark/spark WORKDIR /opt/spark RUN tar -cvzf spark.tgz spark
You don't need to pull down this image unless you want to store it somewhere custom, otherwise you can pull it from S3 using the code below. To create the environments use these instructions to choose the appropriate base image and add the code to the Dockerfile and/or Pre-Run Script. Note: these environments use the Jupyter Spylon kernel rather than PySpark. If you want to use PySpark you will need to install it in the dockerfile for both environments.
Workspace Environment
Base Image: Domino Analytics Distribution (or base image of your choosing)
Dockerfile:
ENV SPARK_HOME=/opt/domino/spark ENV PATH "$PATH:$SPARK_HOME/bin" RUN mkdir /opt/domino RUN wget -q https://domino-spark.s3.us-east-2.amazonaws.com/spark.tar && \ tar -xf spark.tar && \ rm spark.tar && \ mv spark /opt/domino/spark && \ chmod -R 777 /opt/domino/spark/conf && \ rm /opt/domino/spark/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar RUN pip install spylon-kernel RUN python -m spylon_kernel install
Pre-Run Script: Be sure to edit the $AWS_REGION to match the region you operate in.
export PATH="$PATH:$SPARK_HOME/bin" cat >> $SPARK_HOME/conf/spark-defaults.conf << EOF spark.hadoop.aws.region $AWS_REGION spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.experimental.fadvise random spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true spark.hadoop.parquet.enable.summary-metadata false spark.sql.parquet.mergeSchema false spark.sql.parquet.filterPushdown true spark.sql.hive.metastorePartitionPruning true spark.sql.catalogImplementation hive spark.sql.hive.convertMetastoreParquet false spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER EOF cat >> /opt/domino/spark/conf/hive-site.xml <<EOF <configuration> <property> <name>hive.metastore.client.factory.class</name> <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value> </property> </configuration> EOF
Executor Environment
Base Image: Bitnami Spark Image
Dockerfile:
USER root RUN apt-get update && apt-get install -y wget && rm -r /var/lib/apt/lists /var/cache/apt/archives WORKDIR /opt/bitnami # Replace Spark RUN rm -rf spark RUN wget -q https://domino-spark.s3.us-east-2.amazonaws.com/spark.tar && \ tar -xf spark.tar && \ rm spark.tar && \ chmod -R 777 spark/conf && \ rm /opt/bitnami/spark/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar # Rerun Bitnami post-install script WORKDIR / RUN /opt/bitnami/scripts/spark/postunpack.sh WORKDIR /opt/bitnami/spark ENV PATH="$PATH:$SPARK_HOME/bin" USER 1001
Known Issue:
If you run into an issue accessing data tables in Glue and encounter this error message...
java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
Then replace this line in the Pre-Run Script of your Workspace environment
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
with this line
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
Comments
1 comment
If you run into an issue accessing data tables in Glue and encounter this error message...
Then replace this line in the Pre-Run Script of your Workspace environment
with this line
Submitted by: tayler.sale
Please sign in to leave a comment.