Introduction to Data Science in the Big Data World

Book Now


This course provides an introduction to the methods and tools of Data Science in a big data context. Delegates will be introduced to a variety of machine learning algorithms and how to use them in practical examples of real-world problems.


3 days



What you will learn

  • An overview of the field of Data Science, with example use cases, and the skills required by its practitioners
  • An overview of big data and an introduction to to the concepts of distributed computing
  • An introduction to Python, the language of choice for many Data Scientists, including basic functionality and the most useful libraries for analysing and manipulating data
  • An introduction to machine learning describing the different general approaches one can take when building a model
  • Explanation and implementation, by way of example, of several of the most widely used machine learning algorithms
  • Introduction to graph analysis

Course Outline

Day 1:

Data Science in a Big Data world

  • Course overview and introductions
  • What is big data? – The four Vs
  • Big data in action
  • Distributed computing
  • Databases – SQL vs. NoSQL
  • Data Science – What is Data Science?, What is a Data Scientist?, Data Science tools, Use Cases
  • Data Protection and governance

Python Programming

  • Why Python?
  • Demo 1: Data types and functions
  • Demo 2: Data analysis
  • Lab 1: Data type manipulation and functions
  • Lab 2: Data analysis

Basic Statistics

  • Summarising data
  • Data distributions
  • Confidence intervals
  • Correlations and similarity measures
  • Simpson’s paradox
  • Demo: Exploring UK weather data


Day 2:

Introduction to machine learning

  • What is machine learning?
  • Types of machine learning approaches – Supervised learning and unsupervised learning

Supervised learning

  • Predicting the airspeed velocity of an unladen swallow with linear regression- Linear regression explained, Under- and over-fitting, Cost function, Gradient descent, Demo
  • Spam detection using natural language processing and logistic regression – What is classification?, Logic regression explained, Data preparation, Feature construction, Cost function, Training, testing and validation, Assessing model performance, The accuracy fallacy, Demo
  • Other supervised learning methods – k-nearest neighbours, Support vector machines, Support vector regression, Naive Bayes classification
  • Scaling supervised learning


Day 3:

Unsupervised learning

  • Grouping iris varieties using k-means clustering and principal component analysis – What is clustering?, Cluster analysis using k-means, Other types of clustering, Demo 1: k-means clustering, Dimensionality reduction using principal component analysis, Feature scaling, Demo 2: principal component analysis
  • Scaling k-means clustering

Building a recommender system

  • Recommender system explanation
  • Types of recommender system
  • Examples of recommender systems
  • Building a movie recommender
  • Improving your recommender – Dithering, Cross-recommendation
  • Demo: Building a movie recommendation system

Social network analysis using graph theory

  • What is a graph?
  • Social networks
  • Demo: Social network analysis using Gephi


The session will contain a variety of instructor demos and guided hands-on for the students to walk through to aid understanding and appreciation of the topics. There will be plenty of discussion and interactivity.

Additional Information

The course content can be customised to cover any specialised material you may require for your specific training needs.

This course can be offered as private on-site training hosted at your offices. For more information, please contact us at [email protected]