MapR: Developing Apache Spark Applications

Book Now

Overview

This course enables developers to get started developing big data applications with Apache Spark. In the first part of the course, you will use Spark’s interactive shell to load and inspect data. The course then describes the various modes for launching a Spark application. You will then go on to build and launch a standalone Spark application. The concepts are taught using scenarios that also form the basis of hands-on-labs.

Duration

3 days

Who is the course for

Developers interested in designing and developing Spark applications.

Prerequisites

Attendees must have Java programming experience to do the exercises.

  • Basic to intermediate Linux knowledge, including the ability to use a text editor such as vi, and familiarity with basic command-line options such as mv, cp, ssh, grep, cd, userass
  • Knowledge of application development principles
  • A Linux, Windows or MacOS computer with the MapR Sandbox installed (for the on-demand course)
  • Connection to a Hadoop cluster via SSH and web browser (for the ILT and VILT course)

What you will learn

OIncluded in this 3-day course are:

  • Access to a multi-node Amazon Web Services (AWS) cluster
  • Slide Guide pdf
  • Lab Guide pdf
  • Lab Code

Course Outline

Day 1:

Lesson 1 – Introduction to Apache Spark

Describe the features of Apache Spark

  • Advantages of Spark
  • How Spark fits in with the big data application stack
  • How Spark fits in with Hadoop

Define Apache Spark components

 

Lesson 2 – Load and Inspect Data in Spark

  • Describe different ways of getting data into Spark
  • Create and use Resilient Distributed Datasets (RDDs)
  • Apply transformation to RDDs
  • Use actions on RDDs: Lab: Load and inspect data in RDD
  • Cache intermediate RDDs
  • Use Spark DataFrames for simple queries: Lab: Load and inspect data in DataFrames

Lesson 3 – Build a Simple Spark Application

  • Define the lifecycle of a Spark programme
  • Define the function of SparkContext: Lab: create the application
  • Define different ways to run a Spark application
  • Run your Spark application: Lab: launch the application

 

Day 2

Lesson 4 – Work with Pair RDD

  • Describe pair RDD
  • Why use pair RDD
  • Create pair RDD
  • Apply transformations and actions to pair RDD
  • Control partitioning across nodes
  • Changing partitions
  • Determine the partitioner

Lesson 5 – Work with Spark DataFrames

  • Create Apache Spark DataFrames
  • Work with data in DataFrames
  • Create user-defined functions
  • Repartition DataFrame

Lesson 6 – Monitor a Spark Application

  • Describe the components of the Spark execution model
  • Use the SparkUI to monitor a Spark application
  • Debug and tune to Spark applications

 

Day 3

Lesson 7 – Introduction to Apache Spark Data Pipelines

  • Identify components of Apache Spark Unified Stack
  • Benefits of the Apache Spark Unified Stack over Hadoop ecosystem
  • Describe data pipeline use cases

Lesson 8 – Create an Apache Spark Streaming Application

  • Spark streaming architecture
  • Create DStreams
  • Create a simple Spark Streaming application: Lab: Create a Spark Streaming application
  • DStream operations: Lab: Apply an operations on DStreams
  • Apply DStream operations
  • Use Spark SQL to query DStreams
  • Define window operations: Lab: Add windowing operations
  • Use Spark SQL to query DStreams
  • Define window operations: Lab: Add windowing operations
  • Describe how DStreams are fault-tolerant

Lesson 9 – Use Apache Spark GraphX to Analyse Flight Data

  • Describe GraphX
  • Define a property graph: Lab: create a property graph
  • Perform operations on graphs: Lab: Apply graph operations

Lesson 10 – Use Apache Spark MLIib to Predict Flight Delays

  • Describe Spark MLIib
  • Describe a generic classification workflow
  • Describe common terms for supervised learning
  • Use a decision tree for classification and regression
  • Lab: Create a DecisionTree model to predict flight delays on streaming data.

Related Training Courses

 

MapR: Hive and Pig This 2-day course covers how Hive emulates SQL in a Hadoop cluster, dataflow languages and how to create efficient data flows using Pig.

MapR: Developing Hadoop Applications This 3-day course provides instruction on how to write Hadoop application using MapReduce and YARN in Java.

MapR: HBase Applications and Design Build This 3-day course introduces the concepts of NoSQL technologies, HBase architecture, schema design, performance tuning, bulk-loading of data and the storing of complex data structures.

 

No Events on The List at This Time

X