代码模型自我进化超越GPT-4o蒸馏！UIUC伯克利等提出自对齐方法

摘要：实验表明，使用SelfCodeAlign对CodeQwen1.5-7B进行指令微调，在HumanEval+上实现了67.1 pass@1，超过了参数量大10倍的CodeLlama-70B-Instruct。

编辑：alan

【新智元导读】代码模型可以自己进化，利用自身生成的数据来进行指令调优，效果超越GPT-4o直接蒸馏！

LLM作为智能的基座，可以衍生出各种能力。

代码能力就是其中一种：程序补全、注释、优化、修bug、测试等等。

而想要充分发挥LLM的巨大潜力，指令调优（Instruction Tuning）是至关重要的一步。

当前，高质量指令数据主要有两个来源：人工注释和蒸馏。

前者很贵，后者则受到限制。于是，人们开始另辟蹊径。

近日，来自UIUC、伯克利等机构的研究人员提出了SelfCodeAlign。

这篇工作首次证明了，可以通过自对齐（Self-Alignment）来获得强大的代码模型，不需要人工注释或者蒸馏，而且效果更好！

论文地址：https://arxiv.org/pdf/2410.24198

SelfCodeAlign在整个数据生成过程中，使用相同的基础模型进行推理，分为三步：

首先，从高质量的种子片段中提取不同的编码概念，以生成新任务。

然后，对每个任务的多个响应进行采样，将每个响应与测试用例配对，并在沙盒环境中对其进行验证。

最后，选择验证通过的示例来进行指令调优。

SelfCodeAlign是第一个完全透明的pipeline，使用纯自生成的指令数据对基础代码模型进行自对齐。

实验表明，使用SelfCodeAlign对CodeQwen1.5-7B进行指令微调，在HumanEval+上实现了67.1 pass@1，超过了参数量大10倍的CodeLlama-70B-Instruct。

在全部的三项基准测试（代码生成、数据科学编程、代码编辑）中，SelfCodeAlign都战胜了之前最先进的指令微调方法OctoPack。

此外，在HumanEval+上，SelfCodeAlign的性能超越了基于GPT-3.5-Turbo的蒸馏方法（包括 OSS-Instruct（61.6）和Evol-Instruct（59.1）），甚至打败了GPT-4o的直接输出蒸馏（65.9）！

这意味着，从模型自己的数据分布对齐中学习，可能胜于使用强大的teacher模型。

SelfCodeAlign适用于各种规模（从3B到33B）的LLM，比如StarCoder2-Struct就是以此为基础创建的（base model为StarCoder2-15B）。

自对齐代码生成

下图以StarCoder2-15B的指令调优过程为例，展示了SelfCodeAlign的流程：

SelfCodeAlign首先从The Stack V1中收集一组种子代码片段。

此步骤中，确保种子片段多样化且高质量至关重要，它们将用作生成说明和响应的起点。

为了收集种子片段，研究人员从The Stack V1中提取所有带有文档字符串的Python函数，然后应用一系列过滤规则来确保种子片段的质量。

通过运行Pyright类型检查器、删除基准项、过滤掉文档质量差的函数，以及删除几乎重复的函数，总共从5M个函数中过滤出250k个Python函数。

收集种子函数后，开始执行Self-OSS-Instruct，对OSS-Instruct的自对齐进行修改，以生成不同的指令。

具体来说，这里采用上下文学习（In-context learning）让基础模型从给定的种子代码片段中自行生成指令。

### System : I - > R You are an extremely intelligent AI coding assistant . Please provide an accurate and reliable response to each user instruction . After delivering your response , verify its consistency and correctness by writing a series of executable tests . ### System : C - > I Create a series of independent coding tasks that are original , distinct , diverse , and high - quality , fostering logical thinking . Each task must adhere to specified properties : - category : the type of task ( e . g . , function implementation , class implementation , or program implementation ) - language : the programming language to be used - difficulty : the complexity level of the task ( e . g . , easy , medium , or hard ) - concepts : fundamental principles and techniques the task is designed to incorporate , which developers must understand to effectively solve the task Design the tasks so that the relevant concepts emerge naturally as the most appropriate solutions , without explicitly mentioning that a particular concept should be used .作者使用了21个精心设计的示例来教模型如何工作: