Apache Spark - Broadcast Variables

Introduction Broadcast Variables are variables we want to share through out clusters. It allows the programmer to keep a read-only variable cached on each machine rather thatn shipping a copy of it with tasks.

Apache Spark - Join Operation

Introduction Join operation can join two RDDs together. It is probably one of the most common operations on a pair RDD. The join operation has many types: leftOuterJoin, rightOuterJoin, crossJoin innerJoin, etc.

Apache Spark - Accumulators

Introduction

Apache Spark - Data-Partitioning

Introduction Data partitioning lets users control the layout of pair RDD across nodes. Using a proper data partitioning techinque may greately reduce the communication cost between the worker nodes by ensure the data that being accessed together are though the same node, which will significantly imporve the performance.

Apache Spark - Pair RDDs

Introduction Pair RDD is a common data type in many operation on Spark. Many dataset in real life is usually a key-value pairs. The typical pattern of this kind of dataset is that each row of data is a map from one key to one or multiple values. To working with this kind of data more simpler and more efficient, Spark provides a data structure called Pair RDD instead of regular RDDs. Simply, a Pair RDD is a particular type of RDD that can store key-value pairs. It can be created by converting from a regular key-value object or convert by regular RDDs.