The past few years have seen remarkable advances in NLP, as evidenced both by continued and rapid gains on benchmark tasks, as well as by the increasing prominence of real NLP systems in the wild. In assessing such progress, however, it is important to ask not only what system achieves the best performance, but how it achieves that level of performance, how much we can trust the evaluation, and what the consequences of deploying such a system might be.
In this talk, I will present several recent pieces of work related to these questions. In the first part I will focus on the empirical evaluation of NLP systems, and motivate the idea of resource-aware NLP. I will then present an analysis of the state of evaluation in NLP research, with a focus on statistical power, and share some proposals for improved reporting of experimental results. In the second part, I will present a more philosophical analysis of the idea of fairness, and raise some broader difficulties around how to think about creating socially beneficial NLP systems.
Dallas Card is a postdoctoral scholar in the NLP group at Stanford University. His research focuses on what we can learn about people and society from text, and on improving scientific practice in machine learning and NLP.