Create a Spark cluster on AWS


Ludovic Deneuville


We strongly recommend using the Google Chrome browser!

1 Chrome Setup


Only the first lab

2 Run the AWS Learner Lab


To be done for each lab

  • Log in AWS academy
    • use your ENSAI mail
  • Go to the Dashboard
    • Select course AWS Academy Learner Lab [78065]
    • Click on Learner Lab
    • Click on Start Lab ▶️
    • Wait for 2 minutes until the circle next to AWS turns green
    • Click on AWS

Now, you enter the lab.

3 Create a Studio


Only the first lab

  • In the search bar, search for the EMR service
  • Click on Studios, then Create Studio
  • Setup options ➡️ select Custom
  • Studio settings
    • Service role to let Studio access your AWS resources : LabRole
  • Networking and security
    • VPC : select the default VPC
    • Subnets : choose one and note its name for later
    • Click on Create Studio
      • If Encryption key is required, then uncheck the Encrypt Workspace files with your own AWS KMS key box

4 Create a Cluster


Only the first lab

  • Click on Clusters, then Create cluster
  • Go to the Cluster Configuration section
    • For Primary, Core and Task 1 of 1, choose EC2 instance type : m4.large
  • Networking section
    • Use the same subnet as before
  • Bootstrap actions
    • Add
    • Script location : s3://ensai-labs-2023-2024-files/
      • this script contain instructions to install numpy and pandas : sudo pip3 install numpy pandas
  • Security configuration and EC2 key pair section
    • Amazon EC2 key pair for SSH to the cluster : vockey
  • Identity and Access Management (IAM) roles section
    • Service role : EMR_DefaultRole
    • Instance profile : EMR_EC2_DefaultRole
  • Create Cluster

The cluster creation takes some time (10 min). You can advance the reading of the subject.

Go back to EMR > Cluster and wait until the cluster status is Waiting (refresh 🔄)

4.1 Clone the Cluster


Starting from the second lab

  • Go to EMR service and then Clusters
  • Select the old cluster and click on Clone
    • Before validate clone creation (⚠️ TODO because i forgot it in section “Create a cluster”)
    • Go to section Bootstrap actions
    • Add
    • Script location : s3://ensai-labs-2023-2024-files/
      • this script contain instructions to install numpy and pandas on all nodes
      • workaround : if you want, feel free to :
        • create your own sh file,
        • paste this code : sudo pip3 install numpy pandas
        • store the file in your own bucket
        • Script location : *s3:///<>
  • The cluster creation takes some time (10 min). You can advance the reading of the subject.
  • Go back to EMR > Cluster and wait until the cluster status is Waiting (refresh 🔄)

5 Associate the cluster with a workspace


To be done for each lab

  • Click on Workspaces (Notebooks)
  • Select the workspace
    • click on Actions > Stop
    • Refresh until the status turn to Idle
  • Select the workspace, then click on Attach cluster
    • Check Launch in Jupyter
    • EMR cluster : your cluster
    • Attach cluster and launch

A new tab will open. Save the URL, for example by bookmarking it.

Now you can load the lab notebooks (lab_.ipynb file) and then open it.

Once it’s open, verify that the kernel is set to PySpark on the top right. If not : Kernel > Change Kernel > PySpark


First Lab Other Labs
Learner Lab launch launch
Studio create -
Cluster create clone
Workspace attach cluster attach cluster
End of the lab terminate cluster terminate cluster