Create a Spark cluster on AWS
Important
We strongly recommend using the Google Chrome browser!
1 Chrome Setup
Caution
Only the first lab
2 Run the AWS Learner Lab
Caution
To be done for each lab
- Log in AWS academy
- use your ENSAI mail
- Go to the
Dashboard
- Select course AWS Academy Learner Lab [78065]
- Click on
Learner Lab
- Click on Start Lab ▶️
- Wait for 2 minutes until the circle next to AWS turns green
- Click on
AWS
Now, you enter the lab.
3 Create a Studio
Caution
Only the first lab
- In the search bar, search for the EMR service
- Click on
Studios
, then Create Studio - Setup options ➡️ select Custom
- Studio settings
- Service role to let Studio access your AWS resources :
LabRole
- Service role to let Studio access your AWS resources :
- Networking and security
- VPC : select the default VPC
- Subnets : choose one and note its name for later
- Click on
Create Studio
- If Encryption key is required, then uncheck the Encrypt Workspace files with your own AWS KMS key box
4 Create a Cluster
Caution
Only the first lab
- Click on
Clusters
, thenCreate cluster
- Go to the Cluster Configuration section
- For Primary, Core and Task 1 of 1, choose EC2 instance type :
m4.large
- For Primary, Core and Task 1 of 1, choose EC2 instance type :
- Networking section
- Use the same subnet as before
- Bootstrap actions
- Add
- Script location :
s3://ensai-labs-2023-2024-files/install_packages.sh
- this script contain instructions to install numpy and pandas : sudo pip3 install numpy pandas
- Security configuration and EC2 key pair section
- Amazon EC2 key pair for SSH to the cluster : vockey
- Identity and Access Management (IAM) roles section
- Service role : EMR_DefaultRole
- Instance profile : EMR_EC2_DefaultRole
- Create Cluster
The cluster creation takes some time (10 min). You can advance the reading of the subject.
Go back to EMR > Cluster and wait until the cluster status is Waiting (refresh 🔄)
4.1 Clone the Cluster
Caution
Starting from the second lab
- Go to EMR service and then Clusters
- Select the old cluster and click on Clone
- Before validate clone creation (⚠️ TODO because i forgot it in section “Create a cluster”)
- Go to section Bootstrap actions
- Add
- Script location :
s3://ensai-labs-2023-2024-files/install_packages.sh
- this script contain instructions to install numpy and pandas on all nodes
- workaround : if you want, feel free to :
- create your own sh file,
- paste this code : sudo pip3 install numpy pandas
- store the file in your own bucket
- Script location : *s3://
/<your_file.sh>
- The cluster creation takes some time (10 min). You can advance the reading of the subject.
- Go back to EMR > Cluster and wait until the cluster status is Waiting (refresh 🔄)
5 Associate the cluster with a workspace
Caution
To be done for each lab
- Click on
Workspaces (Notebooks)
- Select the workspace
- click on Actions > Stop
- Refresh until the status turn to Idle
- Select the workspace, then click on
Attach cluster
- Check Launch in Jupyter
- EMR cluster : your cluster
- Attach cluster and launch
A new tab will open. Save the URL, for example by bookmarking it.
Now you can load the lab notebooks (lab_.ipynb file) and then open it.
Once it’s open, verify that the kernel is set to PySpark on the top right. If not : Kernel > Change Kernel > PySpark
Summary
First Lab | Other Labs | |
---|---|---|
Learner Lab | launch | launch |
Studio | create | - |
Cluster | create | clone |
Workspace | attach cluster | attach cluster |
End of the lab | terminate cluster ❗ | terminate cluster ❗ |