Hadoop Developer

Book Now

Overview

The Hadoop Developer course gives attendees the essential skills to develop applications with Apache Hadoop and Hadoop ecosystem tools including Hive, Pig, Spark and NiFi. Our goal is to make students productive as quickly as possible, while setting the stage for their future growth as a Hadoop developer.

This course provides the essential grounding in the principles of Hadoop, the MapReduce computation model and the Hadoop Distributed File System (HDFS) storage engine, the roles of other essential tools, and how to write applications effectively using these tools. If your developers are new to Hadoop, they’ll learn the skills necessary to start using Hadoop and integrating it with your existing capabilities.

During this course, each student will have access to their own Amazon Web Services Elastic MapReduce cluster (AWS EMR) cluster to gain hands-on experience with Hadoop. We will also provide students with configuration information necessary for them to create and use identically configured AWS EMR clusters after they have completed the course.

Duration

3 days

Who is the course for

Software Engineers, Data Scientists and Analysts

Prerequisites

The following prerequisites ensure that students gain the maximum benefit from the course:

  • Programming experience – this is a developer’s course. We will write Java, Hive, Spark and Pig applications. Prior programming experience in some modern language is essential; prior experience in Java, Scala, Python, or R experience is helpful for students wishing to build full-scale applications but is not required
  • Linux shell and editor experience (recommended) – basic Linux shell (bash) commands will be used extensively, and students will need to do some text editing using vi or emacs during the course
  • Experience with SQL databases – students will find SQL experience useful for learning Hive, Pig, and SparkSQL but not essential
  • Laptops – students should bring either a Mac or Windows laptop to the course with Safari or Chrome Web browsers installed. Windows users should download and install the gittbash program as well to allow ssh cluster access

What you will learn

Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, students will learn the following:

  • Introduction to Hadoop and the problems it solves
  • How to use the Hadoop command line tools and the web consoles
  • How to create the Hadoop applications using the MapReduce Java and Streaming APIs
  • How to use Hive to analyze unstructured and structured data at a large scale in Hadoop
  • Where cluster bottlenecks typically arise and how to avoid them
  • How to use Spark and SparkSQL to perform in-memory analyses
  • Analyzing small to large data sets with Hive and Spark
  • How to choose the right SQL on Hadoop tool for your application
  • How to automate data ingestion using Apache NiFi and Kylo
  • Common pitfalls and problems, and how to avoid them
  • Lessons from real-world Hadoop implementations

Course Outline

Day 1 – Big Data Fundamentals

Introduction to big data
An introduction to the history and technology of Hadoop for a general audience, including:

  • History of Hadoop
  • Four key differences in big data versus prior computing models
  • Concepts of MapReduce
  • Wordcount concepts and exercise

Cluster Architecture and MapReduce
How Java developers can apply MapReduce basic Java APIs for big data tasks, including:

  • Understanding how clusters are designed
  • Building and structuring MapReduce jobs in Java
  • Mapper and Reducer APIs and classes
  • Differences between MapReduce 1 and 2
  • The streaming API for scripting

Using SQL on Hadoop with Hive
An introduction to using Hive and HQL to manipulate big data, including:

  • Schema on Read concepts and how Hive differs from databases
  • The Metastore and defining databases, tables and schemas
  • Loading data into Hive tables using HQL, including differences from SQL
  • Basic joins, partitioning, and optimization of Hive data

Day 2 – Hive and Spark

Advanced Hive
Advanced topics for Hive developers, including:

  • Advanced joins and how to optimize them
  • SerDes for inputting nonstandard binary data
  • User-defined functions (UDFs) and custom UDFs
  • Complex queries
  • Advanced file formats such as Parquet, ORC AND AVRO

Big Data Bottlenecks
An introduction to the fundamental limitations of traditional Hadoop cluster performance, including:

  • Which cluster components pose the biggest constraints on application performance
  • Metrics every Hadoop programmer must know
  • Best practices for optimizing cluster application results

Spark Architecture and Concepts
An introduction to Spark including:

  • Spark architecture and how it avoids MapReduce bottlenecks
  • How Spark fits into the Hadoop ecosystem
  • Comparing Spark wordcount and other examples in Scala Python, Java and R

Spark in Scala
How to develop Spark programs in Scala

  • Introduction to Scala data types and functions
  • Immutables and mutables and why they matter
  • Using RDDs, Transformations and Actions for big data tasks
  • How lazy evaluation affects program semantics
  • Spark deployment and caching options

Using SparkSQL and Dataframes
An introduction to the use of the SparkSQL and data frames, including:

  • Spark sessions and SparkSQL contexts
  • Dataframes and how they relate to RDDs and datasets
  • SparkSQL’s use of the Hive Metastore
  • File formats available for use with SparkSQL
  • Interleaving of Spark and SparkSQL actions with Scala programs

Day 3 – Additional Hadoop Tools

Data Lake Technology Using Apache Wifi and Kylo
An introduction to Data Lake architecture and technologies including:

  • What a data lake is and how data lakes are used
  • The roles of data governance, provenance and metadata in an effective data lake
  • Overview of data ingestion technologies
  • Apache NiFi overview, capabilities and hands-on use
  • Kylo overview, capabilities and hands-on use
  • Summary of other open-source frameworks

Using Presto
An introduction to federated SQL operations using Presto on Hadoop clusters, including:

  • Presto architecture and metadata
  • How to use Catalogs and Schemas
  • Performing Joins across data sets on different systems (federated queries)

Using HBase
An introduction to the concepts of NoSQL databases and HBase, including:

  • Differences between NoSQL and relational databases
  • How HBase works with Hadoop infrastructure
  • Physical representations of HBase databases on disk
  • Files, Regions, Splits and Compactions

Using Pig
An introduction to using Pig and data flow programming, including:

  • Data flow concepts and Schema on Read in Pig
  • Maps, Bags, tuples and complex data structures
  • Joining strategies
  • GroupBy, Projections and Flattening
  • Pig UDFs and file formats

Format

50% Lecture/Discussion
50% Hands-on Labs

Additional Information

The course content can be customised to cover any specialised material you may require for your specific training needs.

This course can be offered as private on-site training hosted at your offices. For more information, please contact us at [email protected]

Related Training Courses

Apache Cassandra This is a fast-paced, vendor agnostic technical Apache Cassandra course that focuses on the key aspects of the technology for developers and system operations staff, covering core internal and distributed architecture fundamentals.

HDP Analyst: Apache Hbase Essentials​ This 2-day workshop introduces HBase basics, structure and operations in an intensely hands-on experience.

Apache Hadoop Essentials This course is designed to help attendees understand the concepts and benefits of Apache Hadoop and how it can help them meet their business goals.

Machine Learning with Apache Hadoop This course is designed to help attendees understand the high-level concepts and classifications of machine learning systems with a strong focus on building Recommender Systems.

june

26 - 28junejun 269:00 amjun 28Hadoop DeveloperLondon, UK9:00 am - 5:00 pm (28) GMT Teradata UK

X