Introduction to Big Data

Lessons and Labs created by Rémi Pépin and Arthur Katossky

Objectives

Understand the basics of computation in the real world, the bottlenecks and how to solve them
Understand the basics of cloud computing and how to use AWS (or SSP Cloud)
Get familiar with big data technologies and the most common paradigm
Learn how to use Spark for data exploration on data at rest or streamed data, and how to use some basics ML algorithm on big data

Organisation

Lessons : 7h30
Labs : 10h30 + 3h (graded lab)
Presentations : 3h

Lessons

1.1 What is Big Data

What is considered big ?
Volume, Velocity, Variety, Veracity, Value, Variability

1.2 Computer science survival kit

Processors, Memory, Storage, Network

1.3 High-performance computing without distribution

Profile code, Analyse code
Store or process data in chunks, Take advantage of sparcity
Go low-level
Use cache

1.4 What if ?

What if data is too big to fit in memory ?
What if your file is too big for your local file system ?
What if data is too big to fit on disk ?
What if computation takes ages ?
What if computation / storage is too expensive ?

2.1 How to store big data

File system, Database, Distribution
The CAP theorem

2.2 Hadoop file system (HDFS)

Hadoop Ecosystem, How to use HDFS

2.3 Hadoop MapReduce

Key Concepts

2.4 Spark

Key Concepts
Importing/Exporting data
How to run Spark ?

3 Cloud Computing

Traditional IT, Virtualization, Containerization
Why cloud computing ?
categories of cloud services

Labs

Lab 0 : Discover Amazon Web Service (AWS)
Lab 1 : First steps with Spark
Lab 2 : Spark ML
Lab 3 : Stream processing with Spark