Apache Spark - SparkSQL Introduction

Introduction

SparkSQL is a interface on Spark that working with structure and semistructure data.

Structured data

Structured data is any data that has a schema meaning that a known set of fields for each record. Spark SQL provides a dataset abstraction that simplifies working with structured datsets. Dataset is similar to tables in a relational database. More and more Spark workflow is moving towards Spark SQL. The whole point of SparkSQL and its related technology is dealing with structure data.

Dataset that has a natural schema lets Spark store data in a more efficient manner and can run SQL queries on it using actual SQL commands.

Important concepts: DataFrame and Dataset

DataFrame:

  • Spark SQL published a tabular data abstraction called DataFrame since v1.3. A Dataframe is a dataabstraction or a domain-specific language for working with structured and semi-structured data. It can store data in a more efficient manner than native RDDs, taking advantage of their schema.

  • It uses the immutable, in-memory, resilient, distributed and parallel capabilities of RDD, and applies a structure called schema to the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way using Java serialization.

  • Unlike an RDD, data is organized into named columns, like a table in a relational dataset. Also, it provides new operations not available on RDDs, such as the ability to run SQL queries.

Dataset:

  • The Dataset API released since Spark 1.6. it provides the familiar object-oriented programming style, compile-time safety of the RDD API and the benefits of leveraging schema to work with structured data.

  • A dataset is a set of structured data, not necessary a row but it could be aof a particular type.

  • Java and Spark will know the type of the data in a dataset at complile time.

  • Because of its nature, the Dataset API is not available in Python.

DataFrame and Dataset

Since Spark 2.0, DataFrame APIs merge with Dataset APIs.

  • Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API.

  • Consider DataFrame as untyped view of a Dataset, which is a Dataset of Row where a Row is a generic untyped JVM object.

  • Dataset, by contrast, is a collection of strongly-typed JVM objects.

  • The Dataset API is only available on Java and Scala.

  • For Python we stick with the DataFrame API.

Reference:

Thanks for the amazing tutorial by Youtuber Analytics Excellence

The code can be found in the Github repository