Progress in NLP methodology over the last thirty years has been driven by benchmarks, from the Penn Treebank to GLUE. Benchmarks are useful because they provide a standard task, dataset, and means of evaluation that any researcher can use to quickly and easily demonstrate the value of their method. However, in the current age of LLMs, I argue that benchmarking is becoming increasingly obsolete. Beyond challenges such as data contamination, scientific validity of "prompt engineering", and usage of closed-source APIs, each of which is critical in its own right, there exist fundamental issues with how to formulate real-world tasks into benchmarks that can rank LLMs based on the much-desired "single score". I highlight these issues using some of my lab's recent work on tasks such as book-length summarization, long-form question answering, and literary translation. I then describe a more human-centered approach for evaluation that relies on human-LLM collaboration, which I believe is more in line with the practical usage of these models moving forward.
Mohit Iyyer is an associate professor in computer science at the University of Massachusetts Amherst, with a primary research interest in natural language processing. He is the recipient of best paper awards at NAACL (2016, 2018), an outstanding paper award at EACL 2023, and a best demo award at NeurIPS 2015, and he received the 2022 Samsung AI Researcher of the Year award. He received his PhD in computer science from the University of Maryland, College Park in 2017 and spent the following year as a researcher at the Allen Institute for Artificial Intelligence.