Detailed explanation of the principle of Spark scheduling architecture

Author：Eve Cole Update Time：2025-07-18 17:32:02

1. Start the spark cluster, which is to execute sbin/start-all.sh, start the master and multiple worker nodes. The master is mainly used as the management and monitoring of the cluster, and the worker node is mainly responsible for running various applications. The master node needs to let the worker node report its own status, such as the CPU and how large the memory is. This process is completed through the heartbeat mechanism

2. After the master receives the worker's report information, it will give the worker information.

3. Driver submits tasks to spark cluster [The communication between driver and master is done through AKKAactor, that is, master is an actor model in the akkaactor asynchronous communication model, and driver is the same. Driver asynchronously sends registration information (registerApplication) asynchronously to master]

4. The master node estimates the application, 7 G of memory completes the task, allocates the tasks, and each worker node allocates 3.5G of memory to execute the tasks. In the master, the tasks on each worker are monitored and scheduled as a whole.

5. The worker node receives the task and starts execution. Starts the corresponding executor process on the worker node to execute. Each executor has a thread pool concept, which contains multiple task threads.

6. Executor will take out the task from the thread pool to calculate the data in the rddpatition, transform operations, and action operations.

7. The worker node reports the calculation status to the driver node

Create RDDs with local parallelization collection

 public class JavaLocalSumApp{public static void main(String[] args){SparkConf conf = new SparkConf().setAppName("JavaLocalSumApp");JavaSparkContext sc = new JavaSparkContext(conf);List<Integer> list = Arrays.asList(1,3,4,5,6,7,8);//Create RDD through local parallelization sets <Integer> listRDD = sc.parallelize(list);//Sum Integer sum = listRDD.reduce(new Function2<Integer,Integer,Integer,Integer>(){@Override public Integer call(Integer v1,Integer v2) throws Exception{return v1+v2;}});System.out.println(sum)}}// Functional programming in java requires setting the compiler to 1.8listRDD.reduce((v1,v2)=> v1+v2)

Sparktransformation and action operations

RDD: Elastic distributed data set, is a collection that supports multiple sources, has a fault-tolerant mechanism, can be cached, and supports parallel operations. An RDD represents a data set in a partition.

RDD has two operating operators:

Transformation: Transformation is a delay calculation. When one RDD is converted to another RDD, it does not convert immediately. It remembers the logical operations of the data set.

Action: Triggers the operation of Spark jobs, and really triggers the calculation of the conversion operator.

The role of spark operator

This figure describes Spark converts RDD through operators during running conversion. Operators are functions defined in RDD, which can convert and operate data in RDD.

Input: During the Spark program running, data is input to Spark from external data space (such as distributed storage: textFile to read HDFS, etc., and the parallelize method enters Scala collection or data) and the data enters Spark runtime data space, converts it into data blocks in Spark, and is managed through BlockManager.

Run: After the Spark data input is input to form RDD, it can be passed through a transformation operator, such as filter, etc. Operate the data and convert RDD into a new RDD. Through the Action operator, Spark submits the job. If the data needs to be multiplexed, the data can be cached to memory through the Cache operator.

Output: The data after the program runs will be output to Spark runtime space and stored in distributed storage (such as saveAsTextFile output to HDFS), or Scala data or collection (collect output to Scala collection, count returns Scala int type data)

Transformation and Actions operation overview

Transformation

map(func): Returns a new distributed dataset, composed of each original element after being converted by func function
filter(func) : Returns a new dataset, passed through the func function
flatMap(func): Similar to map, but each input element will be mapped to 0 to multiple output elements (so the return value of the func function is a Seq, not a single element)
sample(withReplacement, frac, seed): According to the given random seed, data with a number of frac is randomly sampled.
union(otherDataset): Returns a new dataset, composed of the original dataset and parameters
roupByKey([numTasks]): Called on a dataset composed of (K,V) pairs, returning a dataset of (K,Seq[V]) pairs. Note: By default, 8 parallel tasks are used for grouping. You can pass in numTask optional parameters and set different numbers of Tasks according to the amount of data.
reduceByKey(func, [numTasks]): Used on a (K, V) pair data set, return a (K, V) pair data set, the same value of the key is aggregated together using the specified reduce function. Similar to groupbykey, the number of tasks can be configured with the second optional parameter.
join(otherDataset, [numTasks]): Called on a dataset of types (K,V) and (K,W), returning a (K,(V,W)) pair, and all elements in each key are together.
groupWith(otherDataset, [numTasks]): Called on a dataset of types (K, V) and (K, W) and returns a dataset with the components (K, Seq[V], Seq[W]) Tuples. This operation is in other frameworks called CoGroup
cartesian(otherDataset): Cartesian product. But when called on datasets T and U, a dataset of (T, U) pair is returned, and all elements interact with Cartesian product.

Actions

reduce(func): Aggregate all elements in the dataset through function func. The Func function accepts 2 parameters and returns a value. This function must be associated to ensure that it can be executed correctly and concurrently
collect(): In Driver's program, return all elements of the dataset as an array. This usually returns a small enough data subset after using filter or other operations, and directly returns the entire RDD set Collect, which is likely to make the Driver program OOM
count(): Returns the number of elements in the dataset
take(n): Returns an array consisting of the first n elements of the dataset. Note that this operation is not currently executed in parallel on multiple nodes, but is the machine where the Driver program is located, and all elements are calculated by a single machine (the memory pressure of the Gateway will increase and it needs to be used with caution)
first(): Returns the first element of the dataset (similar to take(1))
saveAsTextFile(path): Save the elements of the dataset in the form of a textfile to the local file system, hdfs or any other file system supported by Hadoop. Spark will call the toString method of each element and convert it into a line of text in the file.
saveAsSequenceFile(path): Save the elements of the dataset in the format of sequencefile to the specified directory, local system, hdfs or any other file system supported by Hadoop. The elements of RDD must be composed of key-value pairs, and they all implement Hadoop's Writable interface, or they can be converted into Writable implicitly (Spark includes conversions of basic types, such as Int, Double, String, etc.)
foreach(func): Run the function func on each element of the dataset. This is usually used to update an accumulator variable or to interact with an external storage system

WordCount execution process

Summarize

The above is all the content of this article about the principles of Spark scheduling architecture. I hope it will be helpful to everyone. Interested friends can continue to refer to other related topics on this site. If there are any shortcomings, please leave a message to point it out. Thank you friends for your support for this site!