Benchmark Transparency: Measuring the Impact of Data on Evaluation

Venelin Kovatchev, Matthew Lease

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we present an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. We propose an automated framework that measures the data point distribution across 6 different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity. We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We experiment on 2 different datasets (SQUAD and MNLI) and test a total of 135 different models (125 on SQUAD and 10 on MNLI). We demonstrate that without explicit control of the data distribution, standard evaluation frameworks are inconsistent and unreliable. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric. In a second set of experiments, we demonstrate that the impact of data on evaluation is not just observable, but also predictable. We propose to use benchmark transparency as a method for comparing datasets and quantifying the similarity between them. We find that the “dataset similarity vector” can be used to predict how well a model generalizes out of distribution.
Original languageEnglish
Title of host publicationProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
EditorsKevin Duh, Helena Gomez, Steven Bethard
PublisherAssociation for Computational Linguistics, ACL
Pages1536–1551
Number of pages16
ISBN (Electronic)9798891761148
DOIs
Publication statusPublished - 16 Jun 2024
Event2024 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2024 - Mexico City, Mexico
Duration: 16 Jun 202421 Jun 2024

Conference

Conference2024 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2024
Abbreviated titleNAACL 2024
Country/TerritoryMexico
CityMexico City
Period16/06/2421/06/24

Fingerprint

Dive into the research topics of 'Benchmark Transparency: Measuring the Impact of Data on Evaluation'. Together they form a unique fingerprint.

Cite this