Create a Spark cluster on AWS
Important
We strongly recommend using the Google Chrome browser!
1 Chrome Setup
Caution
Only the first lab
2 Run the AWS Learner Lab
Caution
To be done for each lab
- Log in AWS academy
- use your ENSAI mail
 
- Go to the Dashboard- Select course AWS Academy Learner Lab [78065]
- Click on Learner Lab
- Click on Start Lab ▶️
- Wait for 2 minutes until the circle next to AWS turns green
- Click on AWS
 
Now, you enter the lab.
3 Create a Studio
Caution
Only the first lab
- In the search bar, search for the EMR service
- Click on Studios, then Create Studio
- Setup options ➡️ select Custom
- Studio settings
- Service role to let Studio access your AWS resources : LabRole
 
- Service role to let Studio access your AWS resources : 
- Networking and security
- VPC : select the default VPC
- Subnets : choose one and note its name for later
- Click on Create Studio- If Encryption key is required, then uncheck the Encrypt Workspace files with your own AWS KMS key box
 
 
4 Create a Cluster
Caution
Only the first lab
- Click on Clusters, thenCreate cluster
- Go to the Cluster Configuration section
- For Primary, Core and Task 1 of 1, choose EC2 instance type : m4.large
 
- For Primary, Core and Task 1 of 1, choose EC2 instance type : 
- Networking section
- Use the same subnet as before
 
- Bootstrap actions
- Add
- Script location : s3://ensai-labs-2023-2024-files/install_packages.sh- this script contain instructions to install numpy and pandas : sudo pip3 install numpy pandas
 
 
- Security configuration and EC2 key pair section
- Amazon EC2 key pair for SSH to the cluster : vockey
 
- Identity and Access Management (IAM) roles section
- Service role : EMR_DefaultRole
- Instance profile : EMR_EC2_DefaultRole
 
- Create Cluster
The cluster creation takes some time (10 min). You can advance the reading of the subject.
Go back to EMR > Cluster and wait until the cluster status is Waiting (refresh 🔄)
4.1 Clone the Cluster
Caution
Starting from the second lab
- Go to EMR service and then Clusters
- Select the old cluster and click on Clone
- Before validate clone creation (⚠️ TODO because i forgot it in section “Create a cluster”)
- Go to section Bootstrap actions
- Add
- Script location : s3://ensai-labs-2023-2024-files/install_packages.sh- this script contain instructions to install numpy and pandas on all nodes
- workaround : if you want, feel free to :
- create your own sh file,
- paste this code : sudo pip3 install numpy pandas
- store the file in your own bucket
- Script location : *s3:///<your_file.sh> 
 
 
 
- The cluster creation takes some time (10 min). You can advance the reading of the subject.
- Go back to EMR > Cluster and wait until the cluster status is Waiting (refresh 🔄)
5 Associate the cluster with a workspace
Caution
To be done for each lab
- Click on Workspaces (Notebooks)
- Select the workspace
- click on Actions > Stop
- Refresh until the status turn to Idle
 
- Select the workspace, then click on Attach cluster- Check Launch in Jupyter
- EMR cluster : your cluster
- Attach cluster and launch
 
A new tab will open. Save the URL, for example by bookmarking it.
Now you can load the lab notebooks (lab_.ipynb file) and then open it.
Once it’s open, verify that the kernel is set to PySpark on the top right. If not : Kernel > Change Kernel > PySpark
Summary
| First Lab | Other Labs | |
|---|---|---|
| Learner Lab | launch | launch | 
| Studio | create | - | 
| Cluster | create | clone | 
| Workspace | attach cluster | attach cluster | 
| End of the lab | terminate cluster ❗ | terminate cluster ❗ |