EvalPlanner：基于“计划-执行”双阶段的大语言模型评估框架

摘要：大语言模型(LLM)评估系统在生成思维链(Chain-of-Thought, CoT)序列时，需要系统地捕捉评估过程中的推理步骤。但是由于缺乏人工标注的CoT训练数据，以及预定义评估提示在复杂任务中的局限性，构建高质量的LLM评估模型面临重大挑战。另外手动调整

大语言模型(LLM)评估系统在生成思维链(Chain-of-Thought, CoT)序列时，需要系统地捕捉评估过程中的推理步骤。但是由于缺乏人工标注的CoT训练数据，以及预定义评估提示在复杂任务中的局限性，构建高质量的LLM评估模型面临重大挑战。另外手动调整评估指令的方法在面对多样化和复杂任务时表现出明显的局限性。

为应对这些挑战，研究团队提出了EvalPlanner[1]，这是一种创新的LLM评估算法。该算法采用计划-执行的双阶段范式，首先生成无约束的评估计划，随后执行该计划并做出最终判断。这种方法显著提升了评估过程的系统性和可靠性。

EvalPlanner的架构包含三个核心组件，如下图所示：

具体来说，系统包含以下关键要素：

a) 评估计划(z)

基于输入指令x，系统制定具体的响应评估策略计划设计注重灵活性和通用性

b) 计划执行模块

依序执行评估计划的各个步骤分析目标响应a和b，生成详细的评估结果

c) 最终判决(y)

在评判LLM(参数θ)的框架下，将计划z和执行e作为潜变量判决生成过程可表述为：

系统的整体工作流程如下图所示：

主要步骤包括：

从分布P中采样多个评估计划z对每个计划，从分布E中采样多个执行路径e通过自训练循环优化计划和执行过程在测试阶段，模型生成结构化的CoT输出：ỹ = (z̃, ẽ, ỹ)

系统采用两类核心任务领域：

通用指令执行任务通过对原始指令引入噪声生成对比样本原始指令响应作为正例，噪声指令响应作为负例数学推理任务采样多个候选响应正确解答作为正例，错误解答作为负例评估计划生成

系统采用通用且无约束的计划生成提示模板，该模板仅基于输入指令查询经过指令调优的LLM以获取初始计划。提示模板的核心内容如下：

We want to evaluate the quality of the responses provided by AI assistants tothe user question displayed below. For that, your task is to help us build an evaluation plan that can then be executed to assess the response quality. Whenever appropriate, you can choose to also include a step-by-step reference answer as part of the evaluation plan. Enclose your evaluation plan between the tags “[Start of Evaluation Plan]” and “[End of Evaluation Plan]”.[User Question]{instruction}计划执行生成

计划执行阶段采用种子模型，结合指令和响应对，基于生成的计划进行推理并产生判决。

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answersthe user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy,depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better.[[User Question]]{instruction}[The Start of Assistant A’s Answer]{response A}[The End of Assistant A’s Answer][The Start of Assistant B’s Answer]{response B}[The End of Assistant B’s Answer]

这种分离式架构具有两个主要优势：

确保执行过程严格遵循预定计划

通过对同一计划采样多个执行路径，增加评估数据的多样性

对于每个输入指令：

采样|P|个计划每个计划采样|E|个执行路径考虑响应对的两种顺序(a,b)和(b,a)，总共生成2×|P|×|E|个CoT序列

系统采用自训练循环进行优化，主要包含以下步骤：