Fully Transparent Self-Alignment for Code Generation

Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and Lingming Zhang
Neural Information Processing Systems (NeurIPS), 2024

Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of Large Language Models (LLMs) to follow human instructions. For programming tasks, most models are fine-tuned using either costly human-annotated instruction-response pairs or those generated by large, proprietary LLMs, which may not be permitted. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiment, we employ SelfCodeAlign with CodeQwen1.5-7B, yielding a dataset of 74k instruction-response pairs. Finetuning CodeQwen1.5-7B on this dataset results in the creation of SelfCodeAlign-CQ-7B. Remarkably, SelfCodeAlign-CQ-7B achieves a pass@1 score of 67.1 on HumanEval+, even outperforming CodeLlama-70B-Instruct, which is ten times larger. Across all evaluated benchmarks, SelfCodeAlign-CQ-7B consistently outperforms CodeQwen1.5-7B trained with OctoPack, the prior state-of-the-art instruction-tuning method without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We also prove the effectiveness of different components in our pipeline and demonstrate that SelfCodeAlign outperforms the state-of-the-art, GPT-3.5-based distillation methods, including OSS-Instruct and Evol-Instruct. Overall, SelfCodeAlign shows for the first time that a strong instruction-tuned code LLM can result from self-alignment rather than distillation. We plan to open-source all code, data, and models.

PDF

@inproceedings{wei:starcoder-self-alignment,
  title="Fully Transparent Self-Alignment for Code Generation",
  author="Yuxiang Wei and Federico Cassano and Jiawei Liu and Yifeng Ding and Naman Jain and Zachary Mueller and Harm de~Vries and Leandro Von~Werra and Arjun Guha and Lingming Zhang",
  booktitle="Neural Information Processing Systems (NeurIPS)",
  year=2024,
  url={https://openreview.net/forum?id=xXRnUU7xTL}
}