Databricks: Spark Essentials

Book Now

Overview

This overview course is a guided hands-on tour of Spark, a popular tool for big data analytics with a unified API for batch analytics, SQL queries, stream processing, machine learning, and graph analysis. The course walks attendees through the lifecycle of a Spark application from Extract-Transform-Load (ETL) operations, through ad-hoc data analysis and SQL queries, to machine learning and beyond. Attendees will learn to use a variety of tools to understand how the entire Spark stack functions, including the underlying Spark execution engine, the fundamental programming abstractions (e.g. Resilient Distributed Datasets as well as DataFrames), and more.

Duration

1 day

Who is the course for

Engineers, Data Scientists, and Analysts

Prerequisites

Students should arrive to class with:

  • A basic understanding of software development
  • Some experience coding in Python, Java, SQL, Scala, or R
  • Modern operating system (Windows, OS X, Linux)
  • An up-to-date version of Chrome or Firefox (Internet Explorer not supported) and Internet access

What you will learn

Overview of Apache Spark and Databricks

  • A brief history of Spark and Databricks
  • Where Spark fits in the big data landscape
  • Apache Spark vs. Apache MapReduce: An architecture comparison

Intro to DataFrames and SparkSQL

  • What are DataFrames?
  • DataFrames and Spark SQL
  • Using SQLContext
  • Creating your first DataFrame
  • Inspecting your DataFrame (e.g. printSchema(), describe(), show(), take() )
  • Running DataFrame operations
  • Reading from multiple data source formats
  • Using the table catalog

Resilient Distributed Datasets: Fundamentals

  • RDDs vs. DataFrames
  • Two ways to create an RDD: Parallelize & Read from external data source
  • How an RDD is distributed via partitions in a cluster
  • Introduction to Transformations and Actions
  • Different types of RDDs
  • How transformations lazily build up a Directed Acyclic Graph (DAG)
  • Introduction to Caching an RDD

Spark Documentation and Resources

  • Spark Guide
  • Spark API Documentation
  • Spark Source Code on Github
  • Discussion Forums
  • Videos, Courses, and Other Resources

Spark Runtime Architecture

  • How these JVMs interact with each other in Spark: Driver, Executor, Worker, Spark Master
  • RDDs, DAGs, and Narrow vs. Wide Operations
  • How jobs are broken into stages and tasks and scheduled for execution.

More on Spark SQL and DataFrames

  • Creating a temporary table from a data source (using a DataFrame)
  • Overview of supported SQL dialect
  • Querying the temporary table with SQL
  • Using the table catalog
  • Table and DataFrame caching
  • Understanding query plans (.explain(true))
  • Working with nested data
  • Statistical functions in DataFrames
  • Working with null data

Spark Streaming

  • Understanding the Streaming Architecture: How DStreams break down into RDD batches
  • How receivers run inside Executor task slots to capture data coming in from a network socket, Kafka or Flume
  • Common transformations and actions on DStreams

Machine Learning

  • Supervised Learning
  • Unsupervised Learning

Course Outline

After taking this class you will be able to:

  • Experiment with use cases for Spark and Databricks, including extract-transform-load operations, data analytics, data visualisation, batch analysis, machine learning, graph processing, and stream processing
  • Identify Spark and Databricks capabilities appropriate to your business needs
  • Communicate with your team members and engineers using appropriate terminology
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Execute and modify extract-transform-load (ETL) jobs to process Big Data using the Spark API, DataFrames, and Resilient Distributed Datasets (RDD)
  • Analyse Spark jobs using the administration UIs and logs inside Databricks
  • Find answers to common Spark and Databricks questions using the documentation and other resources

Hands-on Labs

Spark Guided Tour

  • Connecting to the Databricks notebook lab environment
  • DataFrames
  • Spark SQL
  • Visualisations
  • Transformations
  • Machine Learning
  • Exercise: What can Spark do for your team?

Using DataFrames

  • Examples of the DataFrames API to query and transform data

A Developer’s Introduction to Spark

  • Learn what a SparkContext is and how to use it
  • Using the Spark shell to parallelise data from a local collection and perform transformations and actions
  • Visually seeing the execution of Spark jobs in the Spark UI
  • Catching an RDD
  • How to repartition a RDD and count the # of items in each partition
  • Understanding the Spark lineage graph with /.toDebugString()

Searching the Docs

  • Given a task, locate the appropriate API documentation

Spark UI

  • Visualising jobs and DAGs
  • Monitoring tasks and stages
  • Reading logs

Spark SQL

  • Examples of SQL queries, plus some assignments

Format

  • 50% Lecture
  • 50% Labs

Related Training Courses

Databricks: Spark Development Bootcamp On this 3-day course, developed by Databricks, you will learn how to build and manage Spark applications using Spark’s core programming APIs and its standard libraries.

 

may

22maymay9:00 am- 5:00 pmDatabricks: Spark EssentialsLondon, UK9:00 am - 5:00 pm GMT Teradata UK

X