Databricks: Spark Programming (Spark 105)

Book Now

Overview

This 3-day hands-on workshop will introduce you to Apache Spark with coding exercises and lectures. Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of of nodes.

In this class, you will learn how to build and manage Spark applications using Spark’s core programming APIs and its standard Libraries. You will receive a free Databricks account for the duration of training.

Duration

3 days

Who is the course for

Engineers, Data Scientists, and Analysts

Prerequisites

Students should arrive to class with:

  • A basic understanding of software development
  • Some experience coding in Python, Java, SQL, or Scala
  • A laptop with a modern operating system (Windows, OS X, Linux), browser (Internet Explorer not supported), and Internet access

What you will learn

After taking this class you will be able to:

  • Build a data pipeline using Spark DataFrames and Spark SQL
  • Understand Spark concepts, architecture, and applications
  • Execute SQL queries on large scale data using Spark
  • Explore and visualize your data by entering and running code in Notebooks
  • Train, and use an ML model on real data with Spark’s Machine Learning library MLlib
  • Tune Spark job performance and troubleshoot errors using logs and administration UIs
  • Find answers to common questions using Spark documentation and discussion forums
  • Write and monitor a Spark Streaming job to analyze data with sub-second latency
  • Understand common use-cases and business applications of Spark

Course Outline

Day 1

  • History of Big Data & Apache Spark
  • Databricks Overview
  • Spark Capabilities and Ecosystem
  • Basic Spark Components
  • Spark SQL and DataFrame Uses
  • DataFrame / SQL APIs
  • Catalyst Query Optimization
  • ETL

Day 2

  • Data Sources: reading from Parquet, S3, Cassandra, HDFS, and your local file system
  • Memory & Persistence
  • Jobs, Stages and Tasks
  • Partitions and Shuffling
  • Data Locality
  • Spark’s Architecture

Day 3

  • Structured streaming APIs
  • Windowing
  • Checkpointing and watermarking
  • Streaming DataFrames
  • Reliability and fault tolerance in Spark Streaming
  • Spark MLlib Pipeline API
  • Built-in featurizing and algorithms
  • Basic graph analysis
  • GraphFrames API
  • GraphFrames motif finding
  • Persisting graph data

Format

  • 50% Lecture
  • 50% Labs

Related Training Courses

Databricks: Spark Overview (Spark 100) ​This 1-day overview course is a guided hands-on tour of Spark, a popular tool for Big Data analytics with a unified API for batch analytics, SQL queries, stream processing, machine learning, and graph analysis.

Databricks: Apache Spark for Machine Learning & Data Science (Spark 301) This hands-on Apache Spark training targets experienced Data Scientists wishing to perform data analysis at scale using Apache Spark.

november

28 - 30novembernov 289:00 amnov 30Sold Out!Databricks: Spark Programming (Spark 105)London, UK9:00 am - 5:00 pm (30) GMT Teradata UK

december

12 - 14decemberdec 129:00 amdec 14Databricks: Spark Programming (Spark 105)London, UK9:00 am - 5:00 pm (14) GMT Teradata UK

january

30jan - 1febjan 309:00 amfeb 1- 5:00 pmDatabricks: Spark Programming (Spark 105)London, UK9:00 am - (february 1) 5:00 pm GMT Teradata UK

february

30jan - 1febjan 309:00 amfeb 1- 5:00 pmDatabricks: Spark Programming (Spark 105)London, UK(january 30) 9:00 am - (february 1) 5:00 pm GMT Teradata UK

april

24 - 26aprilapr 249:00 amapr 26Databricks: Spark Programming (Spark 105)London, UK9:00 am - 5:00 pm (26) GMT Teradata UK

X