This course provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.
Who is the course for
Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.
Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.
What you will learn
- Recognise use cases for data science on Hadoop
- Describe the Hadoop and YARN architecture
- Describe supervised and unsupervised learning differences
- Use Mahout to run a machine learning algorithm on Hadoop
- Describe the data science life cycle
- Use Pig to transform and prepare data on Hadoop
- Write a Python script
- Describe options for running Python code on a Hadoop cluster
- Write a Pig User-Defined Function in Python
- Use Pig streaming on Hadoop with a Python script
- Use machine learning algorithms
- Describe use cases for Natural Language Processing (NLP)
- Use the Natural Language Toolkit (NLTK)
- Describe the components of a Spark application
- Write a Spark application in Python
- Run machine learning algorithms using Spark MLlib
- Take data science into production
- Labs: Setting Up a Development Environment
- Demo: Block Storage
- Labs: Using HDFS Commands
- Demo: MapReduce
- Lab: Using Apache Mahout for Machine Learning
- Demo: Apache Pig
- Lab: Getting Started with Apache Pig
- Lab: Exploring Data with Pig
- Lab: Using the IPython Notebook
- Demo: The NumPy Package
- Demo: The pandas Library
- Lab: Data Analysis with Python
- Lab: Interpolating Data Points
- Lab: Defining a Pig UDF in Python
- Lab: Streaming a Python with Pig
- Demo: Classification with Scikit-Learn
- Lab: Computing K-Nearest Neighbour
- Lab: Generating a K-Means Clustering
- Lab: POS Tagging Using a Decision Tree
- Lab: Using NLTK for Natural Language Processing
- Lab: Classifying Text using Naive Bayes
- Lab: Using Spark Transformations and Actions
- Lab: Using Spark MLlib
- Lab: Creating a Spam Classifier with MLlib
- 50% Lecture/Discussion
- 50% Hands-on Labs
Related Training Courses
HDP Developer: Apache Pig and Hive This 4-day hands-on training course teaches attendees how to develop applications and analyse Big Data stored in Apache Hadoop 2.x using Pig and Hive.
HDP Operations: Hadoop Administration 1 This 4-day course is designed for Hortonworks Data Platform administrators, and covers installation, configuration, maintenance, security and performance topics.
HDP Administrator: Security This 3-day course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorisation, auditing and data protection strategies and tools.
HDP Operations: Hortonworks Data Flow This 3-day course is designed for ‘Data Stewards’ or ‘Data Flow Managers’ who are looking forward to automate the flow of data between systems.
HDP Analyst: Apache HBase Essentials This 2-day workshop introduces HBase basics, structure and operations in an intensely hands-on experience.
HDP Operations: Apache HBase Advanced Management This 4-day course is designed for administrators who will be installing, configuring and managing HBase clusters.
HDP Developer: Enterprise Spark 1 This 4-day course is designed as an entry point for developers who need to create applications to analyse big data stored in Apache Hadoop using Spark.