This advanced course provides Java programmers a deep-dive into Hadoop application development. Attendees will learn how to design and develop efficient and effective MapReduce applications for Hadoop using the Hortonworks Data Platform, including how to implement combiners, partitioners, secondary sorts, custom input and output formats, joining large datasets, unit testing, and developing UDFs for Pig and Hive. Labs are run on a 7-node HDP 2.1 cluster running in a virtual machine that attendees can keep for use after the training
Who is the course for
Experienced Java software engineers who need to develop Java MapReduce applications for Hadoop.
Attendees must have experience developing Java applications and using a Java IDE. Labs are completed using the Eclipse IDE and Gradle. No prior Hadoop knowledge is required.
What you will learn
- Describe Hadoop 2 and the Hadoop Distributed File System
- Describe the YARN framework
- Develop and run a Java MapReduce application on YARN
- Use combiners and in-map aggregation
- Write a custom partitioner to avoid data skew on reducers
- Perform a secondary sort
- Recognize use cases for built-in input and output formats
- Write a custom MapReduce input and output format
- Optimize a MapReduce job
- Configure MapReduce to optimize mappers and reducers
- Develop a custom RawComparator class
- Distribute files as LocalResources
- Describe and perform join techniques in Hadoop
- Perform unit tests using the UnitMR API
- Describe the basic architecture of HBase
- Write an HBase MapReduce application
- List use cases for Pig and Hive
- Write a simple Pig script to explore and transform big data
- Write a Pig UDF (User-Defined Function) in Java
- Write a Hive UDF in Java
- Use JobControl class to create a MapReduce workflow
- Use Oozie to define and schedule workflows
- Configuring a Hadoop Development Environment
- Putting data into HDFS using Java
- Write a distributed grep MapReduce application
- Write an inverted index MapReduce application
- Configure and use a combiner
- Writing custom combiners and partitioners
- Globally sort output using the TotalOrderPartitioner
- Writing a MapReduce job to sort data using a composite key
- Writing a custom InputFormat class
- Writing a custom OutputFormat class
- Compute a simple moving average of stock price data
- Use data compression
- Define a RawComparator
- Perform a map-side join
- Using a Bloom filter
- Unit testing a MapReduce job
- Importing data into HBase
- Writing an HBase MapReduce job
- Writing User-Defined Pig and Hive functions
- Defining an Oozie workflow
- 50% Lecture/Discussion
- 50% Hands-on Labs
Related Training Courses
HDP Developer: Apache Pig and Hive This 4-day hands-on training course teaches attendees how to develop applications and analyse Big Data stored in Apache Hadoop 2.x using Pig and Hive.
HDP Operations: Hadoop Administration 1 This 4-day course is designed for Hortonworks Data Platform administrators, and covers installation, configuration, maintenance, security and performance topics.
HDP Operations: Hadoop Administration 2 This 3-day course is designed for experienced administrators who manage Hortonworks Data Platform (HDP) 2.3 clusters with Ambari.
HDP Administrator: Security This 3-day course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorisation, auditing and data protection strategies and tools.
HDP Analyst: Data Science This 3-day course provides instruction on the processes and practice of data science, including machine learning and natural language processing.
HDP Operations: Hortonworks Data Flow This 3-day course is designed for ‘Data Stewards’ or ‘Data Flow Managers’ who are looking forward to automate the flow of data between systems.
HDP Analyst: Apache HBase Essentials This 2-day workshop introduces HBase basics, structure and operations in an intensely hands-on experience.
HDP Operations: Apache HBase Advanced Management This 4-day course is designed for administrators who will be installing, configuring and managing HBase clusters.
HDP Developer: Enterprise Spark 1 This 4-day course is designed as an entry point for developers who need to create applications to analyse Big Data stored in Apache Hadoop using Spark.