摘要:当一个 Spark 应用被提交时,首先需要为这个应用构建基本的运行环境,即由任务控制节点(Driver)创建一个 SparkContext 对象,由 SparkContext 负责与资源管理器(Cluster Manager)的通信以及进行资源的申请、任务的分
分享兴趣,传播快乐,增长见闻,留下美好!
亲爱的您,这里是LearningYard新学苑。
今天小编为大家带来文章
“小h漫谈(15):Spark运行基本流程”
欢迎您的访问。
Share interest, spread happiness, increase knowledge, leave a beautiful!
Dear, this is LearningYard Academy.
Today Xiaobian brings you an article
"Xiaoh's Ramblings (15): Basic Operation Process of Spark"
Welcome to your visit.
一、思维导图(Mind mapping)
二、精读内容(Intensive Reading Content)
如下图所示,Spark 的基本运行流程如下。
As shown in the figure below, the basic operation process of Spark is as follows.
(1)当一个 Spark 应用被提交时,首先需要为这个应用构建基本的运行环境,即由任务控制节点(Driver)创建一个 SparkContext 对象,由 SparkContext 负责与资源管理器(Cluster Manager)的通信以及进行资源的申请、任务的分配和监控等。
When a Spark application is submitted, it is first necessary to build a basic operating environment for this application. That is, the Task Control Node (Driver) creates a SparkContext object. The SparkContext is responsible for communicating with the Resource Manager (Cluster Manager), as well as applying for resources, allocating tasks, and monitoring.
SparkContext会向资源管理器注册并申请运行 Executor 的资源,SparkContext 可以看成是应用程序连接集群的通道。
The SparkContext registers with the Resource Manager and applies for resources to run Executors. The SparkContext can be regarded as a channel for the application to connect to the cluster.
(2)资源管理器为 Executor 分配资源,并启动 Executor 进程,Executor 运行情况将随着“心跳” 发送到资源管理器上。
The Resource Manager allocates resources for the Executors and starts the Executor processes. The running status of the Executors will be sent to the Resource Manager along with "heartbeats".
(3)SparkContext 根据 RDD 的依赖关系构建 DAG 图,并将 DAG 图提交给 DAG 调度器 (DAGScheduler)进行解析,将 DAG 图分解成多个“阶段”(每个阶段都是一个任务集),并且计算出各个阶段之间的依赖关系,然后把一个个“任务集”提交到底层的任务调度器(TaskScheduler)进行处理。
The SparkContext constructs a DAG graph based on the dependency relationships of RDDs and submits the DAG graph to the DAG Scheduler (DAGScheduler) for parsing. The DAG graph is decomposed into multiple "stages" (each stage is a task set), and the dependency relationships between various stages are calculated. Then, each "task set" is submitted to the underlying Task Scheduler for processing.
Executor 向 SparkContext 申请任务,任务调度器将任务分发给 Executor 运行,同时,SparkContext 将应用程序代码发放给 Executor。
The Executors apply to the SparkContext for tasks. The Task Scheduler distributes tasks to the Executors for execution. At the same time, the SparkContext distributes the application code to the Executors.
(4)任务在 Executor 上运行,把执行结果反馈给任务调度器,然后反馈给 DAG 调度器,运行完毕后写入数据并释放所有资源。
The tasks run on the Executors, and the execution results are fed back to the Task Scheduler, and then to the DAG Scheduler. After the operation is completed, the data is written and all resources are released.
总体而言,Spark 运行架构具有以下特点。
Overall, the Spark operating architecture has the following characteristics:
(1)每个应用都有自己专属的 Executor 进程,并且该进程在应用运行期间一直驻留。Executor进程以多线程的方式运行任务,减少了多进程任务频繁的启动开销,使任务执行变得非常高效和可靠。
Each application has its own dedicated Executor process, which remains resident throughout the application's runtime. The Executor process runs tasks in a multithreaded manner, reducing the frequent startup overhead of multi - process tasks and making task execution highly efficient and reliable.
(2)Spark 运行过程与资源管理器无关,只要能够获取 Executor 进程并保持通信即可。
The Spark running process is independent of the resource manager. It only needs to obtain the Executor process and maintain communication.
(3)Executor 上有一个 BlockManager 存储模块,类似于键值存储系统(把内存和磁盘共同作为存储设备)。
There is a BlockManager storage module on the Executor, which is similar to a key - value storage system (using both memory and disk as storage devices).
(4)任务采用了数据本地性和推测执行等优化机制。
The tasks adopt optimization mechanisms such as data locality and speculative execution.
今天的分享就到这里了
如果您对今天的文章有独特的想法
欢迎给我们留言
让我们相约明天
祝您今天过得开心快乐!
That's all for today's sharing.
If you have a unique idea about the article
please leave us a message
and let us meet tomorrow
I wish you a nice day!
文案|小h
排版|小h
审核|ls
参考资料:
文字:《Spark编程基础》
图片:CSDN博客
本文由LearningYard新学苑整理并发出,如有侵权请在后台留言!
来源:LearningYard学苑