Introduction to Big Data

Lessons and Labs created by Rémi Pépin and Arthur Katossky

Objectives

  • Understand the basics of computation in the real world, the bottlenecks and how to solve them
  • Understand the basics of cloud computing and how to use AWS (or SSP Cloud)
  • Get familiar with big data technologies and the most common paradigm
  • Learn how to use Spark for data exploration on data at rest or streamed data, and how to use some basics ML algorithm on big data

Organisation

  • Lessons : 7h30
  • Labs : 10h30 + 3h (graded lab)
  • Presentations : 3h

Lessons

1.1 What is Big Data

  • What is considered big ?
  • Volume, Velocity, Variety, Veracity, Value, Variability

1.2 Computer science survival kit

  • Processors, Memory, Storage, Network

1.3 High-performance computing without distribution

  • Profile code, Analyse code
  • Store or process data in chunks, Take advantage of sparcity
  • Go low-level
  • Use cache

1.4 What if ?

  • What if data is too big to fit in memory ?
  • What if your file is too big for your local file system ?
  • What if data is too big to fit on disk ?
  • What if computation takes ages ?
  • What if computation / storage is too expensive ?

1.5 Social issues

  • Ethical issues, Environmental issues, Political issues

2.1 How to store big data

  • File system, Database, Distribution
  • The CAP theorem

2.2 Hadoop file system (HDFS)

  • Hadoop Ecosystem, How to use HDFS

2.3 Hadoop MapReduce

  • Key Concepts

2.4 Spark

  • Key Concepts
  • Importing/Exporting data
  • How to run Spark ?

3 Cloud Computing

  • Traditional IT, Virtualization, Containerization
  • Why cloud computing ?
  • categories of cloud services

Labs

  • Lab 0 : Discover Amazon Web Service (AWS)
  • Lab 1 : First steps with Spark
  • Lab 2 : Spark ML
  • Lab 3 : Stream processing with Spark