Apache Spark Main Concepts

4 min readJul 7, 2022

Apache Spark is an open-source framework that is used for big data processing.

APIs in Java, Scala, Python and R are provided by Apache Spark. Also, there are five components: Spark SQL , pandas API on Spark , MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

In this article, I would like to talk about cluster managers, RDD, and Spark Architecture.

Cluster Managers

Apache Spark supports many cluster managers. For example, Standalone, Mesos, YARN, and Kubernetes. It gives the application ability to work.

The Spark Cluster Manager communicates with a cluster to acquire resources for an application. While an application is running, the Spark Context creates tasks and communicates to the cluster manager what resources are needed. Then, the cluster manager reserves executor cores and memory resources. After that, tasks are transferred to the executor.

If you want to check how to decide the number of executors, executor memory, executor cores, and driver memory, you can watch this short and clear explanation.

Let’s take a brief look at Spark Standalone, which is the cluster manager that comes with the Spark installation and does not require any additional configuration.

It consists of two important components: Workers and masters. I drew the Sparkalone architecture below. Here, a master runs on (any) cluster node. It allows the workers to connect to the cluster.

Also, there is another term the local mode doesn’t need anything other than basic spark installation (eg: no need to connect any cluster). It is used to test a spark application with a small amount of data. So, there are two different modes: cluster mode and local mode. Maybe, I can write that in another article :).

RDD- Resilient Distributed Datasets

According to Spark Documentation, a Resilient distributed dataset (RDD) is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

As seen in the above image, on the left side, the dataset is partitioned, then the worker memory stores them.

Let’s look at how to create RDDs:

Parallelizing an existing collection in the driver program
Referencing a dataset in an external storage system (e.g. HDFS, HBase, etc.)

Parallelizing:

dataset = [10, 20, 30]
parallelData = sc.parallelize(dataset)
or
parallelData = sc.parallelize(dataset,4)(sc: SparkContext)

What does 4 mean here? There will be 4 partitions for each CPU (in the cluster). If you don’t put this, there will be an automatic decision for the number of partitions.

External Datasets:

external_file= sc.textFile("dataset.csv")

Now, we should see RDD Operations…🤔

RDD Transformations

Transformations create a new RDD from the existing one. One key thing is that transformations are LAZY. It means when a result is requested by action to return, the results will be computed.

There are some common transformations:

map
filter
flatmap
groupby
sortby
reduceByKey
coalesce

2. RDD Actions

After the computation, a value (to the driver program) is returned by actions.

There are some common actions:

reduce
collect
count
foreach

For example, collect() returns all the elements (entire) of the dataset as an array at the driver program. This is generally okay with small(limited) dataset, because it can cause of out of memory…

Let’s focus on the rest of the spark architecture. 🎿

SPARK ARCHITECTURE

A Spark Application has two main processes:

Driver Program & Executors

Driver Program

The Driver’s job is

run the application’s user code → create work → send it to the Cluster.

All right, what about Executors?

Executors

Executors run multiple threads to do work -in parallel- for the cluster at the same time. However, they work independently . You can change how many they will be according to the configuration you make.

Spark Context

When the application launches and the Spark Context must be created in the driver.

And, now we are ready to learn what happen when the driver creates the work.

These works are called as “Jobs”. This Spark Context that we created in the driver divides the jobs into tasks that can run in parallel in Executors, but how? Tasks from jobs can performs on different partitions.

Look at the image, there is Worker Node. Executor starts to run task after worker node says “Launch”. (We know what executor does from the above part)

Note:
Increasing Executor & Increasing Cores = Increases Cluster Parallelism.

Okay, a task finished, now the executor will put the results in a new RDD partition or transfer back to the driver.

I wrote this article with IBM Big Data with Spark and Hadoop Course, and Spark Documentation.
Please leave your comments and questions.
And, let’s take a coffee break now :)