StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson
Findings of the Association for Computational Linguistics (ACL Findings), 2024

Code LLMs have the potential to make it easier for non-experts to understand and write code. However, current CodeLLM benchmarks rely on a single expert-written prompt per problem, making it hard to generalize their success to non-expert users. In this paper, we present a new natural-language-to-code benchmark of prompts written by a key population of non-experts: beginning programmers. StudentEval contains 1,749 prompts written by 80 students who have only completed one introductory Python course. StudentEval contains numerous non-expert prompts describing the same problem, enabling exploration of key factors in prompt success. We use StudentEval to evaluate 12 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. Our analysis of student prompting strategies reveals that nondeterministic LLM sampling can mislead students about the quality of their descriptions, a finding with key implications for Code LLMs in education.

Dataset

PDF available on arXiv

  @inproceedings{babe:studenteval,
  title = {{{StudentEval}}: {{A Benchmark}} of {{Student-Written Prompts}} for {{Large Language Models}} of {{Code}}},
  booktitle = {Findings of the {{Association}} for {{Computational Linguistics}}},
  author = {Babe, Hannah McLean and Nguyen, Sydney and Zi, Yangtian and Guha, Arjun and Feldman, Molly Q and Anderson, Carolyn Jane},
  year = {2024}
}