Building Big Data Analytics with Spark

Book Now


This Big Data Analytics course gives attendees the essential skills to develop analytical applications using Spark and SparkSQL. Our goal is to make students productive as quickly as possible, while setting the stage for their future growth as a Hadoop developer.

This course provides the essential grounding in the principles of Hadoop, the MapReduce computation model and the Hadoop Distributed File System (HDFS) storage engine, the role of Spark, SparkSQL, and SparkML, and how to write applications using these tools. If your developers are new to Hadoop, they’ll learn the skills necessary to start using Hadoop and integrating it with your existing capabilities.

During this course, each student will have access to their own Amazon Web Services Elastic MapReduce (AWS EMR) cluster to gain hands-on experience with Hadoop. We will also provide students with configuration information necessary for them to create and use identically configured AWS EMR clusters after they have completed the course.


3 days

Who is the course for

Data Scientists and Analysts.


The following prerequisites ensure that students gain the maximum benefit from the course.

  • Programming experience: this is an analytics developer’s course. We will write Java, Hive, Spark, and Pig applications. Prior programming experience in some modern language is essential; prior experience in Java, Scala, Python or R is helpful for students wishing to build full-scale applications but is not required.
  • Linux shell and editor experience (recommended): Basic Linux shell (bash) commands will be used extensively, and students will need to do some text editing using vi or emacs during the course.
  • Experience with SQL databases: students will find SQL experience useful for learning Hive, Pig, and SparkSQL but not essential.
  • Laptops: students should bring either a Mac or Windows laptop to the course with Safari or Chrome Web browsers installed. Windows users should download and install the gitbash program as well to allow ssh cluster access.

What you will learn

Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, students will learn the following:

  • Introduction to Hadoop and the problems it solves.
  • How to use the Hadoop command line tools and the web consoles.
  • How to create Hadoop applications using the MapReduce Java and streaming APIs.
  • How the Hive Metastore works and why it’s key to Spark.
  • Where cluster bottlenecks typically arise and how to avoid them.
  • How to use Spark and SparkSQL to perform in-memory analyses.
  • How to use SparkML and Spark MLLib to automate machine learning over big data.
  • Common pitfalls and problems, and how to avoid them.
  • Lessons from real-world Hadoop implementations.

Course Outline

Day 1 – Big Data Fundamentals

Introduction to big data

An introduction to the history and technology of Hadoop for a general audience, including:

  • History of Hadoop
  • Four key differences in big data versus prior computing models
  • Concepts of MapReduce
  • Wordcount concepts and exercise

Cluster Architecture and MapReduce

How Java developers can apply MapReduce basic Java APIs for big data tasks, including:

  • Understanding how clusters are designed
  • Building and structuring MapReduce jobs in Java
  • Mapper and Reducer APIs and classes
  • Differences between MapReduce 1 and 2
  • The streaming API for scripting

Understanding SQL on Hadoop 

An introduction to using Hive and HQL to manipulate big data including:

  • Schema on Read concepts and how Hive differs from databases
  • The Metastore and defining databases, tables, and schemas
  • Loading data into Hadoop and the Hive Metastore, including internal and external tables
  • Querying tables using HQL including differences from SQL
  • Basic joins, partitioning, and optimization of big data

Day 2 – Spark and SparkSQL

Big Data Bottlenecks

An introduction to the fundamental limitations of traditional Hadoop cluster performance, including

  • Which cluster components pose the biggest constraints application performance
  • Metrics every Hadoop programmer must know
  • Best practices for optimizing cluster application results

Spark Architecture and Concepts

An introduction to Spark including:

  • Spark architecture and how it avoids MapReduce bottlenecks
  • How Spark fits into the Hadoop ecosystem
  • Comparing Spark wordcount and other examples in Scala, Python, Java, and R

Spark in Scala

How to develop Spark programs in Scala

  • Introduction to Scala data types and functions
  • Immutables and mutables and why they matter
  • Using RDDs, transformations and actions for big data tasks
  • How lazy evaluation affects program semantics
  • Spark deployment and caching options

Using SparkSQL and Dataframes 

An introduction to the use of the SparkSQL and data frames including:

  • Spark sessions and SparkSQL contexts
  • Dataframes and how they relate to RDDs and datasets
  • SparkSQL’s use of the Hive Metastore
  • File formats available for use with SparkSQL
  • Interleaving of Spark and SparkSQL actions with Scala programs

Spark Machine Learning Techniques 

An introduction to the use of the Spark MLlib package including:

  • ML and MLlib data types Vectors, LabeledPoints and Ratings
  • Using sampled training data sets and the predict function
  • Use of feature extraction, classification, regression, and clustering algorithms
  • Construction of Machine Learning data pipelines

Day 3 – Advanced Spark: Streaming, PySpark and SparkR

Spark Streaming

An introduction to the use of the Spark Streaming including:

  • The differences between traditional Spark Streaming and Streaming
  • Discretized Streams (DStreams) and microbatching
  • The structure of streaming programs
  • Inputand output considerations for streaming
  • How to integrate streaming applications into a larger big data workflow

Using Python on Spark

How to develop Spark programs in Python including:

  • How Python programming differs from the Scala API
  • How to develop interactive and standalone PySpark applications
  • When programmers should apply SparkML libraries instead of Scikit-Learn
  • Understanding serialization bottlenecks in Python

Using R on Spark

How to develop Spark programs in R including:

  • How the R programming differs from the Java, Scala, and Python Spark APIs
  • Developing SparkR programs using the standard Spark Dataframe API
  • Higher level programming in R using sparklyr
  • Interactions and integration with dplyr and ggplot2 packages
  • Understanding serialization bottlenecks in R and Python


50% Lecture/Discussion

50% Hands-on Labs

Related Training Courses

Big Data Concepts and Hadoop Essentials This 1-day course is for everyone interested in adopting big data or involved in data-driven business change and anyone who needs an overview of the Hadoop technology ecosystem.

Introduction to Apache NiFi / Kylo This course provides an introduction to the concept of data lakes, and how Apache NiFi may be used to develop and administer your data lake.

Business Analytics Solutions in the Hadoop Ecosystem This course provides the essential grounding in the principles of Hadoop, the MapReduce computation model and the Hadoop Distributed File System (HDFS) storage engine.

Introduction to Data Science in the Big Data World This course provides an introduction to the methods and tools of data science in a big data concept.

No Events on The List at This Time