Search
Carol McDonald

How to Get Started with Spark Streaming and MapR Event Store Using the Kafka API

February 19, 2021

Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019 may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit https://www.hpe.com/us/en/software/data-fabric.html

Original Post Information:

"authorDisplayName": "Carol McDonald",
"publish": "2017-05-15T07:00:00.000Z",
"tags": "apache-spark"

The telecommunications industry is on the verge of a major transformation through the use of advanced analytics and big data technologies like the MapR Data Platform.

This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Event Store and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. MapR Event Store is a distributed messaging system for streaming event data at scale. MapR Event Store enables producers and consumers to exchange events in real time via the Apache Kafka 0.9 API. MapR Event Store integrates with Spark Streaming via the Kafka direct approach. This post is a simple "how to" example. If you are new to Spark Streaming and the Kafka API you might want to read these first:

Example Use Case Data Set

The example data set consists of aggregated mobile network data generated by the Telecom Italia cellular network over the city of Milano and Trento. The data measures the location and level of interaction of the users with the mobile phone network based on mobile events that occurred on the mobile network over the course of two months in 2013. The projects in the challenge used this data to provide insights, identify and predict mobile phone-based location activity trends and patterns of a population in a large metropolitan area.

The Data Set Schema 

  1. Square id: id of the location in the city grid
  2. Time interval: the beginning of the time interval
  3. Country code: the phone country code
  4. SMS-in activity: SMS received inside the Square id
  5. SMS-out activity: SMS sent inside the Square id
  6. Call-in activity: received calls inside the Square id
  7. Call-out activity: issued calls inside the Square id
  8. Internet traffic activity: internet traffic inside the Square id

The Data Records are in TSV format.

Example Use Case Code

First, you import the packages needed to integrate MapR Streams (Now called MapR Event Store) with Spark Streaming and Spark SQL.

In order for Spark Streaming to read messages from MapR Event Store, you need to import classed from org.apache.spark.streaming.kafka.v09
In order for Spark Streaming to write messages to MapR Event Store, you need to import classes from org.apache.spark.streaming.kafka.producer.

Parsing the Data Set Records

A Scala CallDataRecord case class defines the schema corresponding to the TSV records. The parseCallDataRecord function parses the tab separated values into the CallDataRecord case class.

Spark Streaming Code

These are the basic steps for the Spark Streaming Consumer Producer code:

  1. Configure Kafka Consumer Producer properties.
  2. Initialize a Spark StreamingContext object. Using this context, create a DStream which reads message from a Topic.
  3. Apply transformations (which create new DStreams).
  4. Write messages from the transformed DStream to a Topic.
  5. Start receiving data and processing. Wait for the processing to be stopped.

We will go through each of these steps with the example application code.

1) Configure Kafka Consumer Producer properties

The first step is to set the KafkaConsumer and KafkaProducer configuration properties, which will be used later to create a DStream for receiving/sending messages to topics. You need to set the following parameters:

  • Key and value deserializers: for deserializing the message.
  • Auto offset reset: to start reading from the earliest or latest message.
  • Bootstrap servers: this can be set to a dummy host:port since the broker address is not actually used by MapR Event Store.

​For more information on the configuration parameters, see the MapR Event Store documentation.

2) Initialize a Spark StreamingContext object.

We use the KafkaUtils createDirectStream method with a StreamingContext object, the Kafka configuration parameters, and a list of topics to create an input stream from a MapR Event Store topic. This creates a DStream that represents the stream of incoming data, where each message is a key value pair. We use the DStream map transformation to create a DStream with the message values.

3) Apply transformations (which create new DStreams)

Next we use the DStream foreachRDD method to apply processing to each RDD in this DStream.  We parse the message values into CallDataRecord objects, with the map operation on the DStream, then we convert the RDD to a DataFrame, which allows you to use DataFrames and SQL operations on streaming data.

4) Write messages from the transformed DStream to a Topic

The CallDataRecord RDD objects are grouped and counted by the squareId. Then this sendToKafka method is used to send the messages with the squareId and count to a topic.

5) Start receiving data and processing it. Wait for the processing to be stopped.

To start receiving data, we must explicitly call start() on the StreamingContext, then call awaitTermination to wait for the streaming computation to finish.

Software

You can download the code, data, and instructions to run this example from here: Code: https://github.com/caroljmcdonald/mapr-streams-spark

Summary

In this blog post, you learned how to integrate Spark Streaming with MapR Event Store to consume and produce messages using the Kafka API.

References and More Information:

This blog was originally published on September 6, 2016.

Related

Ted Dunning & Ellen Friedman

3 ways a data fabric enables a data-first approach

Mar 15, 2022
Nicolas Perez

A Functional Approach to Logging in Apache Spark

Feb 5, 2021
Cenz Wong

Getting Started with DataTaps in Kubernetes Pods

Jul 6, 2021
Kiran Kumar Mavatoor

Accessing HPE Ezmeral Data Fabric Object Storage from Spring Boot S3 Micro Service deployed in K3s cluster

Sep 13, 2021
Carol McDonald

An Inside Look at the Components of a Recommendation Engine

Jan 22, 2021
Carol McDonald

Analyzing Flight Delays with Apache Spark GraphFrames and MapR Database

Dec 16, 2020
Nicolas Perez

Apache Spark as a Distributed SQL Engine

Jan 7, 2021
Nicolas Perez

Apache Spark Packages, from XML to JSON

Dec 11, 2020

HPE Developer Newsletter

Stay in the loop.

Sign up for the HPE Developer Newsletter or visit the Newsletter Archive to see past content.

By clicking on “Subscribe Now”, I agree to HPE sending me personalized email communication about HPE and select HPE-Partner products, services, offers and events. I understand that my email address will be used in accordance with HPE Privacy Statement. You may unsubscribe from receiving HPE and HPE-Partner news and offers at any time by clicking on the Unsubscribe button at the bottom of the newsletter.

For more information on how HPE manages, uses, and protects your personal data please refer to HPE Privacy Statement.