This course is designed as an entry point for developers who need to create applications to analyse big data stored in Apache Hadoop using Spark. Topics include: An overview of the Hortonworks Data Platform (HDP), including HDFS and YARN; using Spark Core APIs for interactive data exploration; Spark SQL and DataFrame operations; Spark Streaming and DStream operations; data visualisation, reporting, and collaboration; performance monitoring and tuning; building and deploying Spark applications; and an introduction to the Spark Machine Learning Library.
Who is the course for
Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative applications in an Enterprise HDP environment.
Students should be familiar with programming principles and have previous experience in software development using either Python or Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not required.
What you will learn
- Describe Hadoop, HDFS, YARN, and the HDP ecosystem
- Describe Spark use cases
- Explore and manipulate data using Zeppelin
- Explore and manipulate data using Spark REPL
- Exmplain the purpose and function of RDDs
- Employ functional programming practices
- Perform Spark transformations and actions
- Work with Pair RDDs
- Perform Spark queries using Spark Streaming stateless and window transformations
- Visualise data, generate reports, and collaborate using Zeppelin
- Monitor Spark applications using Spark History Server
- Learn general application optimisation guidelines / tips
- Use data caching to increase performance of applications
- Build and package Spark applications
- Deploy applications to the cluster using YARN
- Understand the purpose of Spark MLlib
Hands-on Lab Activities
Labs can be performed using either Python or Scala
- Using common HDFS commands
- Use a REPL to program in Spark
- Use Zeppelin to program in Spark
- Perform RDD transformations using Spark Streaming
- Perform window-based transformations
- Use Zeppelin for data visualisation and reporting
- Monitor applications using Spark History Server
- Cache and persist data
- Configure checkpointing, broadcast variables, and executors
- Build and submit a Spark application to YARN
- Run Spark MLlib applications
- 50% Lecture/Discussion
- 50% Hands-on Labs
Related Training Courses
HDP Developer: Apache Pig and Hive This 4-day hands-on training course teaches attendees how to develop applications and analyse big data stored in Apache Hadoop 2.x using Pig and Hive.
HDP Operations: Hadoop Administration 1 This 4-day course is designed for Hortonworks Data Platform administrators, and covers installation, configuration, maintenance, security and performance topics.
HDP Administrator: Security This 3-day course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorisation, auditing and data protection strategies and tools.
HDP Analyst: Data Science This 3-day course provides instruction on the processes and practice of data science, including machine learning and natural language processing.
HDP Operations: Hortonworks Data Flow This 3-day course is designed for ‘Data Stewards’ or ‘Data Flow Managers’ who are looking forward to automate the flow of data between systems.
HDP Analyst: Apache HBase Essentials This 2-day workshop introduces HBase basics, structure and operations in an intensely hands-on experience.
HDP Operations: Apache HBase Advanced Management This 4-day course is designed for administrators who will be installing, configuring and managing HBase clusters.