Search
Tugdual Grall

Setting Up Spark Dynamic Allocation on MapR

February 5, 2021

Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019 may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit https://www.hpe.com/us/en/software/data-fabric.html

Original Post Information:

"authorDisplayName": "Tugdual Grall",
"publish": "2016-11-03T07:00:00.000Z",
"tags": "apache-spark"

Apache Spark can use various cluster managers to execute applications (Standalone, YARN, Apache Mesos). When you install Apache Spark on MapR, you can submit an application in Standalone mode or by using YARN.

This blog post focuses on YARN and dynamic allocation, a feature that lets Spark add or remove executors dynamically based on the workload. You can find more information about this feature in this presentation from Databricks:

Let’s see how to configure Spark and YARN to use dynamic allocation (that is disabled by default).

Prerequisites

  • MapR Data Platform cluster
  • Apache Spark for MapR installed

The example below is for MapR 5.2 with Apache Spark 1.6.1; you just need to adapt the version to your environment.

Enabling Dynamic Allocation in Apache Spark

The first thing to do is to enable dynamic allocation in Spark. To do this, you need to edit the Spark configuration file on each Spark node

/opt/mapr/spark/spark-1.6.1/conf/spark-defaults.conf

and add the following entries:

spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true
spark.dynamicAllocation.minExecutors = 5 
spark.executor.instances = 0

You can find additional configuration options in the Apache Spark Documentation.

Enabling Spark External Shuffle for YARN

Now you need to edit the YARN configuration to add information about Spark Shuffle Service. Edit the following file on each YARN node:

/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/yarn-site.xml

and add these properties:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,mapr_direct_shuffle,spark_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

Add Spark Shuffle to YARN classpath

Spark Shuffle service must be added to the YARN classpath. The jar is located in the Spark distribution:

/opt/mapr/spark/spark-1.6.1/lib/spark-1.6.1-mapr-1605-yarn-shuffle.jar

To do this, add the jar in the following folder on each node:

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib

You can either copy the file or create a symlink:

$ ln -s /opt/mapr/spark/spark-1.6.1/lib/spark-1.6.1-mapr-1605-yarn-shuffle.jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib

Restart YARN

Since you have changed the YARN configuration, you must restart your node managers using the following command:

$ maprcli node services -name nodemanager -action restart -nodes [list of nodes]

Submitting a Spark Job

Your MapR cluster is now ready to use Spark dynamic allocation. This means that when you submit a job, you do not need to specify any resource configuration. For example:

/opt/mapr/spark/spark-1.6.1/bin/spark-submit \
  --class com.mapr.demo.WordCountSorted \
  --master yarn \
  ~/spark-examples-1.0-SNAPSHOT.jar \
  /mapr/my.cluster.com/input/4gb_txt_file.txt \
  /mapr/my.cluster.com/user/mapr/output/

Note that you can still specify the resources, but in this case, the dynamic allocation will not be used for this specific job. For example:

/opt/mapr/spark/spark-1.6.1/bin/spark-submit \
  --class com.mapr.demo.WordCountSorted \
  --master yarn \
  --num-executors 3
  --executor-memory 1G \
  ~/spark-examples-1.0-SNAPSHOT.jar \
  /mapr/my.cluster.com/input/4gb_txt_file.txt \
  /mapr/my.cluster.com/user/mapr/output/

In this blog post, you learned how to set up Spark dynamic allocation on MapR.

Related

Ted Dunning & Ellen Friedman

3 ways a data fabric enables a data-first approach

Mar 15, 2022
Nicolas Perez

A Functional Approach to Logging in Apache Spark

Feb 5, 2021
Cenz Wong

Getting Started with DataTaps in Kubernetes Pods

Jul 6, 2021
Kiran Kumar Mavatoor

Accessing HPE Ezmeral Data Fabric Object Storage from Spring Boot S3 Micro Service deployed in K3s cluster

Sep 13, 2021
Carol McDonald

An Inside Look at the Components of a Recommendation Engine

Jan 22, 2021
Carol McDonald

Analyzing Flight Delays with Apache Spark GraphFrames and MapR Database

Dec 16, 2020
Nicolas Perez

Apache Spark as a Distributed SQL Engine

Jan 7, 2021
Carol McDonald

Apache Spark Machine Learning Tutorial

Nov 25, 2020

HPE Developer Newsletter

Stay in the loop.

Sign up for the HPE Developer Newsletter or visit the Newsletter Archive to see past content.

By clicking on “Subscribe Now”, I agree to HPE sending me personalized email communication about HPE and select HPE-Partner products, services, offers and events. I understand that my email address will be used in accordance with HPE Privacy Statement. You may unsubscribe from receiving HPE and HPE-Partner news and offers at any time by clicking on the Unsubscribe button at the bottom of the newsletter.

For more information on how HPE manages, uses, and protects your personal data please refer to HPE Privacy Statement.