Abstract
Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the bias attribution score, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories—suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.
Original language | English |
---|---|
Title of host publication | 38th Pacific Asia Conference on Language, Information and Computation |
Place of Publication | Tokyo, Japan |
Publisher | Association for Computational Linguistics, ACL |
Number of pages | 10 |
Publication status | Accepted/In press - 15 Oct 2024 |
Event | 38th Pacific Asia Conference on Language, Information and Computation - Tokyo University of Foreign Studies, Tokyo, Japan Duration: 7 Dec 2024 → 9 Dec 2024 https://sites.google.com/view/paclic38 |
Publication series
Name | Proceedings of the Pacific Asia Conference on Language, Information and Computation |
---|---|
ISSN (Print) | 2012-3736 |
Conference
Conference | 38th Pacific Asia Conference on Language, Information and Computation |
---|---|
Abbreviated title | PACLIC 38 (2024) |
Country/Territory | Japan |
City | Tokyo |
Period | 7/12/24 → 9/12/24 |
Internet address |