Building Big Data Analytics Solutions in the Hadoop Ecosystem

Book Now


Hadoop is more than just a couple technologies such as Hive and Spark; it’s an entire ecosystem of tools that can generate business value. This 3-day Big Data Analytics course gives attendees the essential skills to develop analytical applications using Big Data tools such as Apache Hive, Apache Spark, Presto, Apache NiFi, and Kylo. Our goal is to make students productive in modern Hadoop concepts and tools, while setting the stage for their future growth as a Hadoop developer.

This course provides the essential grounding in the principles of Hadoop, the Hadoop Distributed File System (HDFS) storage engine, the roles of new Hadoop computation models such as Spark, and how to write applications effectively using these tools. If your developers are new to Hadoop, they’ll learn the skills necessary to start using Hadoop and integrating it with your existing capabilities.

During this 3-day course, each student will have access to their own Amazon Web Services Elastic MapReduce (AWS EMR) cluster to gain hands-on experience with Hadoop. We will also provide students with configuration information necessary for them to create and use identically configured AWS EMR clusters after they have completed the course.


3 days

Who is the course for

Analytics Developers and Data Engineers wishing to learn how to apply Big Data tools in their analyses.


The following prerequisites ensure that students gain the maximum benefit from the course:

  • Programming experience: This is an analytics developer’s course. We will write Java, Hive, Spark, and Pig applications. Prior programming experience in some modern language is essential; prior experience in Java, Scala, Python, or R experience is helpful for students wishing to build full-scale applications but is not required.
  • Linux shell and editor experience (recommended): Basic Linux shell (bash) commands will be used extensively, and students will need to do some text editing using vi or emacs during the course.
  • Experience with SQL databases: Students will find SQL experience useful for learning Hive, Pig, and SparkSQL but not essential
  • Laptops: Students should bring either a Mac or Windows laptop to the course with Safari or Chrome Web browsers installed. Windows users should download and install the gitbash program as well to allow ssh cluster access.

What you will learn

Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, students will learn the following:

  • How to use the Hadoop command line tools and the web consoles
  • Introduction to Hadoop and the problems it solves
  • How to use the Hadoop command line tools and the web consoles
  • How to create Hadoop applications using the MapReduce Java and Streaming APIs
  • How to use Hive to analyze unstructured and structured data at a large scale in Hadoop
  • Where cluster bottlenecks typically arise and how to avoid them
  • How to write Spark code to do in-memory analyses
  • How to use SparkSQL to integrate with the Hive metastore
  • Techniques for analyzing small to large data sets with Hive and Spark
  • How to choose the right SQL on Hadoop tool for your application
  • How to automate data ingestion using Apache NiFi and Kylo
  • How to federate queries across databases using Presto
  • Common pitfalls and problems, and how to avoid them
  • Lessons from real-world Hadoop implementations

Course Outline

Day 1 – Big Data Fundamentals

Introduction to big data
An introduction to the history and technology of Hadoop for a general audience, including:

  • History of Hadoop
  • Four key differences in big data versus prior computing models
  • Concepts of MapReduce
  • Wordcount concepts and exercise

Cluster Architecture and MapReduce
How Java developers can apply MapReduce basic Java APIs for big data tasks, including:

  • Understanding how clusters are designed
  • Building and structuring MapReduce jobs in Java
  • Mapper and Reducer APIs and classes
  • Differences between MapReduce 1 and 2
  • The streaming API for scripting

Using SQL on Hadoop with Hive
An introduction to using Hive and HQL to manipulate big data, including:

  • Schema on Read concepts and how Hive differs from databases
  • The Metastore and defining databases, tables and schemas
  • Loading data into Hive using HQL, including differences from SQL
  • Basic joins, partitioning, and optimization of Hive data

Big Data Bottenecks

An introduction to the fundamental limitations of traditional Hadoop cluster performance, including:

  • Which cluster components pose the biggest constraints on application performance
  • Metrics every Hadoop programmer must know
  • Best practices for optimizing cluster application results

Spark Architecture and Concepts

An introduction to Spark, including:

  • Spark architecture and how it avoids MapReduce bottlenecks
  • How Spark fits into the Hadoop ecosystem
  • Comparing Spark wordcount and other examples in Scala, Python, Java and R

Day 2 – Hive and Spark in detail

Hive in Depth
Advanced topics for Hive developers, including:

  • Advanced joins and how to optimize them
  • SerDes for inputting nonstandard binary data
  • User-defined functions (UDFs) and custom UDFs
  • Complex queries
  • Advanced file formats such as Parquet, ORC and AVRO

Spark in Scala
How to develop Spark programs in Scala

  • Introduction to Scala data types and functions
  • Immutables and mutables and why they matter
  • Using RDDs, Transformations and Actions for big data tasks
  • How lazy evaluation affects program semantics
  • Spark deployment and caching options

Using SparkSQL and Dataframes
An introduction to the use of the SparkSQL and data frames, including:

  • Spark sessions and SparkSQL contexts
  • Dataframes and how they relate to RDDs and datasets
  • SparkSQL’s use of the Hive Metastore
  • File formats available for use with SparkSQL
  • Interleaving of Spark and SparkSQL actions with Scala programs

Spark Streaming Using Datasets

An introduction to the use of Spark 2, datasets and Spark Streaming including:

  • What constitutes a dataset
  • How Spark can stream microbatches from datasets
  • Dataset selection, projection and aggregation operations
  • Window operations on event time
  • Handling late data

Day 3 – Specialized Hadoop Tools

Data Ingestion Using Apache NiFi and Kylo
An introduction to Data Lake architecture and technologies including:

  • What a data lake is and how data lakes are used
  • The roles of data governance, provenance and metadata in an effective data lake
  • Overview of data ingestion technologies
  • Apache NiFi overview, capabilities and hands-on use
  • Kylo overview, capabilities and hands-on use
  • Summary of other open source frameworks

Using Presto
An introduction to federated SQL operations using Presto on Hadoop clusters, including:

  • Presto architecture and metadata
  • How to use Catalogs and Schemas
  • Performing Joins across data sets on different systems (federated queries)

Developing NoSQL Databases Using HBase
An introduction to the concepts of NoSQL databases and HBase, including:

  • Differences between NoSQL and relational databases
  • How HBase works with Hadoop infrastructure
  • Physical representations of HBase databases on disk
  • Files, Regions, Splits and Compactions


50% Lecture/Discussion
50% Hands-on Labs

Additional Information

The course content can be customised to cover any specialised material you may require for your specific training needs.

This course can be offered as private on-site training hosted at your offices. For more information, please contact us at [email protected]

Related Training Courses

Apache Cassandra This is a fast-paced, vendor agnostic technical Apache Cassandra course that focuses on the key aspects of the technology for developers and system operations staff, covering core internal and distributed architecture fundamentals.

HDP Analyst: Apache Hbase Essentials​ This 2-day workshop introduces HBase basics, structure and operations in an intensely hands-on experience.

Apache Hadoop Essentials This course is designed to help attendees understand the concepts and benefits of Apache Hadoop and how it can help them meet their business goals.

Machine Learning with Apache Hadoop This course is designed to help attendees understand the high-level concepts and classifications of machine learning systems with a strong focus on building Recommender Systems.

No Events on The List at This Time