Introduction to Big Data
Lessons and Labs created by Rémi Pépin and Arthur Katossky
Objectives
- Understand the basics of computation in the real world, the bottlenecks and how to solve them
- Understand the basics of cloud computing and how to use AWS (or SSP Cloud)
- Get familiar with big data technologies and the most common paradigm
- Learn how to use Spark for data exploration on data at rest or streamed data, and how to use some basics ML algorithm on big data
Organisation
- Lessons : 7h30
- Labs : 10h30 + 3h (graded lab)
- Presentations : 3h
Lessons
1.1 What is Big Data
- What is considered big ?
- Volume, Velocity, Variety, Veracity, Value, Variability
1.2 Computer science survival kit
- Processors, Memory, Storage, Network
1.3 High-performance computing without distribution
- Profile code, Analyse code
- Store or process data in chunks, Take advantage of sparcity
- Go low-level
- Use cache
1.4 What if ?
- What if data is too big to fit in memory ?
- What if your file is too big for your local file system ?
- What if data is too big to fit on disk ?
- What if computation takes ages ?
- What if computation / storage is too expensive ?
2.1 How to store big data
- File system, Database, Distribution
- The CAP theorem
2.2 Hadoop file system (HDFS)
- Hadoop Ecosystem, How to use HDFS
2.3 Hadoop MapReduce
- Key Concepts
2.4 Spark
- Key Concepts
- Importing/Exporting data
- How to run Spark ?
3 Cloud Computing
- Traditional IT, Virtualization, Containerization
- Why cloud computing ?
- categories of cloud services
1.5 Social issues