Apache Spark - SparkSQL Run on Amazon EMR Cluster

28 Mar 2024 ~ 28 Mar 2024

Introduction

This blog will show how to run a spark application on Amazon EMR(Elastic MapReduce) cluster. We are going to run our spark application on top of the Hadoop cluster and we will put the input data source into the Amazon S3.

Amazon EMR

Amazon EMR cluster provides a managed Hadoop framework that makes it easy, fast, adn cost-effecitve to process vast amounts of dat aacross dynamically scalable Amazon EC2 instances.

We can also run other popular distributed frameworks such as Apache Spark and HBase in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon S3 (Simple Storage Service)

S3 is a distributed storage system and AWS’s equivalent to HDFS. The reason for storing our data in to a cloud platform instead of local computet is straightforward. We want to make sure that

Our data is comingfrom some distributed file system that can be accessed by every node on our spark cluster.
Our Spark application doesn’t assume that our input data sits somewhere on our local disk because that will not scale.

By saving our input data source into S3, each spark node deployed on the EMR cluster can read the input data source form S3.

Experiment

Note: Since the reference tutorial is from 2020, the AWS web page has changed a lot at 2024. But the key process remains the same.

Create EMR Cluster

Log into your aws account and search for EMR, create a EMR cluster.

Use a spark application bundle.
Choose the types of instance for Master, Core Task Nodes.
Select EC2 key pair. I would be used when connecting using SSH. It is necessary configuration for connecting to the cluster. Then, we can create the cluster. It takes 10 - 15 mins to create.

Update security setting to enable SSH connection

Open the security groups for the master node:
Click the master group, and hit edit under the Inbound.
Add Rule. Type is SSH and the Source is Anywhere. Note: Anywhere allows all the Ip address to connect, which would never be used in a product environment. Usually we will set a allowed IP range for the connection.

Upload Dataset and script

Create a new S3 bucket.
Upload the csv file to the new created S3 bucket.

Update and Run Spark application script.

Modified the local application script.

from pyspark.sql import SparkSession

AGE_MIDPOINT = "age_midpoint"
SALARY_MIDPOINT = "salary_midpoint"
SALARY_MIDPOINT_BUCKET = "salary_midpoint_bucket"

if __name__ == "__main__":

    # remove the master("local[1]") to enable running on multiple machine.

    #session = SparkSession.builder.appName("StackOverFlowSurvey").master("local[1]")getOrCreate()
    session = SparkSession.builder.appName("StackOverFlowSurvey").getOrCreate()

    # set the log level to ERROR, avoid too much info log output.
    sc = session.sparkContext
    sc.setLogLever('ERROR')

    dataFrameReader = session.read

    responses = dataFrameReader \
        .option("header", "true") \
        .option("inferSchema", value = True) \
        # .csv("in/2016-stack-overflow-survey-responses.csv")
        .csv("s3n://{bucket_name}/{file_name}/2016-stack-overflow-survey-responses.csv")
        # the spark will load file from s3

    print("=== Print out schema ===")
    responses.printSchema()

Then upload the python script to S3 bucket.

Connect and run the Spark application using SSH

Find the Master public DNS of the EMR we created.
Connect to the master machine using PuTTY or simple ssh. Add the key pair to the connection.
After succefully connected, use the terminal to fetch the python script to the master machine for execution.
```
  aws s3 cp s3://{bucket_name}/{file_name}.py
```
Then, use spark-submit to run the application
```
  spark-submit {python_script_name}.py
```

Now we know how to run a Spark cluster on a remote machine!

Reference:

Thanks for the amazing tutorial by Youtuber Analytics Excellence

The code can be found in the Github repository