This Big Data Analytics course gives attendees the essential skills to develop analytical applications using Spark and SparkSQL. Our goal is to make students productive as quickly as possible, while setting the stage for their future growth as a Hadoop developer.
This course provides the essential grounding in the principles of Hadoop, the MapReduce computation model and the Hadoop Distributed File System (HDFS) storage engine, the role of Spark, SparkSQL, and SparkML, and how to write applications using these tools. If your developers are new to Hadoop, they’ll learn the skills necessary to start using Hadoop and integrating it with your existing capabilities.
During this course, each student will have access to their own Amazon Web Services Elastic MapReduce (AWS EMR) cluster to gain hands-on experience with Hadoop. We will also provide students with configuration information necessary for them to create and use identically configured AWS EMR clusters after they have completed the course.
Who is the course for
Data Scientists and Analysts.
The following prerequisites ensure that students gain the maximum benefit from the course.
- Programming experience: this is an analytics developer’s course. We will write Java, Hive, Spark, and Pig applications. Prior programming experience in some modern language is essential; prior experience in Java, Scala, Python or R is helpful for students wishing to build full-scale applications but is not required.
- Linux shell and editor experience (recommended): Basic Linux shell (bash) commands will be used extensively, and students will need to do some text editing using vi or emacs during the course.
- Experience with SQL databases: students will find SQL experience useful for learning Hive, Pig, and SparkSQL but not essential.
- Laptops: students should bring either a Mac or Windows laptop to the course with Safari or Chrome Web browsers installed. Windows users should download and install the gitbash program as well to allow ssh cluster access.
What you will learn
Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, students will learn the following:
- Introduction to Hadoop and the problems it solves.
- How to use the Hadoop command line tools and the web consoles.
- How to create Hadoop applications using the MapReduce Java and streaming APIs.
- How the Hive Metastore works and why it’s key to Spark.
- Where cluster bottlenecks typically arise and how to avoid them.
- How to use Spark and SparkSQL to perform in-memory analyses.
- How to use SparkML and Spark MLLib to automate machine learning over big data.
- Common pitfalls and problems, and how to avoid them.
- Lessons from real-world Hadoop implementations.
Day 1 – Big Data Fundamentals
Introduction to big data
An introduction to the history and technology of Hadoop for a general audience, including:
- History of Hadoop
- Four key differences in big data versus prior computing models
- Concepts of MapReduce
- Wordcount concepts and exercise
Cluster Architecture and MapReduce
How Java developers can apply MapReduce basic Java APIs for big data tasks, including:
- Understanding how clusters are designed
- Building and structuring MapReduce jobs in Java
- Mapper and Reducer APIs and classes
- Differences between MapReduce 1 and 2
- The streaming API for scripting
Understanding SQL on Hadoop
An introduction to using Hive and HQL to manipulate big data including:
- Schema on Read concepts and how Hive differs from databases
- The Metastore and defining databases, tables, and schemas
- Loading data into Hadoop and the Hive Metastore, including internal and external tables
- Querying tables using HQL including differences from SQL
- Basic joins, partitioning, and optimization of big data
Day 2 – Spark and SparkSQL
Big Data Bottlenecks
An introduction to the fundamental limitations of traditional Hadoop cluster performance, including
- Which cluster components pose the biggest constraints application performance
- Metrics every Hadoop programmer must know
- Best practices for optimizing cluster application results
Spark Architecture and Concepts
An introduction to Spark including:
- Spark architecture and how it avoids MapReduce bottlenecks
- How Spark fits into the Hadoop ecosystem
- Comparing Spark wordcount and other examples in Scala, Python, Java, and R
Spark in Scala
How to develop Spark programs in Scala
- Introduction to Scala data types and functions
- Immutables and mutables and why they matter
- Using RDDs, transformations and actions for big data tasks
- How lazy evaluation affects program semantics
- Spark deployment and caching options
Using SparkSQL and Dataframes
An introduction to the use of the SparkSQL and data frames including:
- Spark sessions and SparkSQL contexts
- Dataframes and how they relate to RDDs and datasets
- SparkSQL’s use of the Hive Metastore
- File formats available for use with SparkSQL
- Interleaving of Spark and SparkSQL actions with Scala programs
Spark Machine Learning Techniques
An introduction to the use of the Spark MLlib package including:
- ML and MLlib data types Vectors, LabeledPoints and Ratings
- Using sampled training data sets and the predict function
- Use of feature extraction, classification, regression, and clustering algorithms
- Construction of Machine Learning data pipelines
Day 3 – Advanced Spark: Streaming, PySpark and SparkR
An introduction to the use of the Spark Streaming including:
- The differences between traditional Spark Streaming and Streaming
- Discretized Streams (DStreams) and microbatching
- The structure of streaming programs
- Inputand output considerations for streaming
- How to integrate streaming applications into a larger big data workflow
Using Python on Spark
How to develop Spark programs in Python including:
- How Python programming differs from the Scala API
- How to develop interactive and standalone PySpark applications
- When programmers should apply SparkML libraries instead of Scikit-Learn
- Understanding serialization bottlenecks in Python
Using R on Spark
How to develop Spark programs in R including:
- How the R programming differs from the Java, Scala, and Python Spark APIs
- Developing SparkR programs using the standard Spark Dataframe API
- Higher level programming in R using sparklyr
- Interactions and integration with dplyr and ggplot2 packages
- Understanding serialization bottlenecks in R and Python
50% Hands-on Labs
Related Training Courses
Big Data Concepts and Hadoop Essentials This 1-day course is for everyone interested in adopting big data or involved in data-driven business change and anyone who needs an overview of the Hadoop technology ecosystem.
Introduction to Apache NiFi / Kylo This course provides an introduction to the concept of data lakes, and how Apache NiFi may be used to develop and administer your data lake.
Business Analytics Solutions in the Hadoop Ecosystem This course provides the essential grounding in the principles of Hadoop, the MapReduce computation model and the Hadoop Distributed File System (HDFS) storage engine.
Introduction to Data Science in the Big Data World This course provides an introduction to the methods and tools of data science in a big data concept.