Apache Spark - SparkSQL Introduction

26 Mar 2024 ~ 26 Mar 2024

Introduction

SparkSQL is a interface on Spark that working with structure and semistructure data.

Structured data

Structured data is any data that has a schema meaning that a known set of fields for each record. Spark SQL provides a dataset abstraction that simplifies working with structured datsets. Dataset is similar to tables in a relational database. More and more Spark workflow is moving towards Spark SQL. The whole point of SparkSQL and its related technology is dealing with structure data.

Dataset that has a natural schema lets Spark store data in a more efficient manner and can run SQL queries on it using actual SQL commands.

Important concepts: DataFrame and Dataset

`DataFrame`:

Spark SQL published a tabular data abstraction called DataFrame since v1.3. A Dataframe is a dataabstraction or a domain-specific language for working with structured and semi-structured data. It can store data in a more efficient manner than native RDDs, taking advantage of their schema.
It uses the immutable, in-memory, resilient, distributed and parallel capabilities of RDD, and applies a structure called schema to the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way using Java serialization.
Unlike an RDD, data is organized into named columns, like a table in a relational dataset. Also, it provides new operations not available on RDDs, such as the ability to run SQL queries.

Dataset:

The Dataset API released since Spark 1.6. it provides the familiar object-oriented programming style, compile-time safety of the RDD API and the benefits of leveraging schema to work with structured data.
A dataset is a set of structured data, not necessary a row but it could be aof a particular type.
Java and Spark will know the type of the data in a dataset at complile time.
Because of its nature, the Dataset API is not available in Python.

DataFrame and Dataset

Since Spark 2.0, DataFrame APIs merge with Dataset APIs.

Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API.
Consider DataFrame as untyped view of a Dataset, which is a Dataset of Row where a Row is a generic untyped JVM object.
Dataset, by contrast, is a collection of strongly-typed JVM objects.
The Dataset API is only available on Java and Scala.
For Python we stick with the DataFrame API.

Reference:

Thanks for the amazing tutorial by Youtuber Analytics Excellence

The code can be found in the Github repository

spark