Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Anton Lozkhov, Carolyn Jane Anderson and Arjun Guha
Conference on Language Modelling (COLM), 2024

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language instructions, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is instructed to update a block of code provided in a prompt. The editing instruction may ask for a feature to added or removed, describe a bug and ask for a fix, ask for a different kind of solution, or many other common code editing tasks.

We introduce a carefully crafted benchmark of code editing tasks and use it evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is 8.8% better than the best open model at editing code.

We also introduce a new, carefully curated, permissively licensed training set of code edits coupled with natural language instructions. Using this training set, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities.

PDF

@inproceedings{cassano:canitedit,
  title = {Can {{It Edit}}? {{Evaluating}} the {{Ability}} of {{Large Language Models}} to {{Follow Code Editing Instructions}}},
  booktitle = {Conference on {{Language Modelling}} ({{COLM}})},
  author = {Cassano, Federico and Li, Luisa and Sethi, Akul and Shinn, Noah and {Brennan-Jones}, Abby and Lozhkov, Anton and Anderson, Carolyn Jane and Guha, Arjun},
  year = {2024}
}