A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

Lance Calvin Gamboa*, Mark Lee

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the bias attribution score, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories—suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.
Original languageEnglish
Title of host publication38th Pacific Asia Conference on Language, Information and Computation
Place of PublicationTokyo, Japan
PublisherAssociation for Computational Linguistics, ACL
Number of pages10
Publication statusAccepted/In press - 15 Oct 2024
Event38th Pacific Asia Conference on Language, Information and Computation - Tokyo University of Foreign Studies, Tokyo, Japan
Duration: 7 Dec 20249 Dec 2024
https://sites.google.com/view/paclic38

Publication series

NameProceedings of the Pacific Asia Conference on Language, Information and Computation
ISSN (Print)2012-3736

Conference

Conference38th Pacific Asia Conference on Language, Information and Computation
Abbreviated titlePACLIC 38 (2024)
Country/TerritoryJapan
CityTokyo
Period7/12/249/12/24
Internet address

Bibliographical note

Not yet published as of 19/11/2024.

Fingerprint

Dive into the research topics of 'A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia'. Together they form a unique fingerprint.

Cite this