Projects per year
Abstract
Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on abilities of these models to generalise representations to CS corpora. We release all our code and data, including the novel corpus, at https://github.com/francesita/code-mixed-probes.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) |
Editors | Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue |
Publisher | European Language Resources Association (ELRA) |
Pages | 3457–3468 |
Number of pages | 12 |
ISBN (Electronic) | 9782493814104 |
Publication status | Published - 20 May 2024 |
Event | 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy Duration: 20 May 2024 → 25 May 2024 https://lrec-coling-2024.org/ |
Publication series
Name | International conference on computational linguistics |
---|---|
ISSN (Print) | 2951-2093 |
Name | LREC proceedings |
---|---|
ISSN (Print) | 2522-2686 |
Conference
Conference | 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation |
---|---|
Abbreviated title | LREC-COLING 2024 |
Country/Territory | Italy |
City | Torino |
Period | 20/05/24 → 25/05/24 |
Internet address |
Bibliographical note
Publisher Copyright:© 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Keywords
- code-switching
- multilingualism
- probing language models
ASJC Scopus subject areas
- Theoretical Computer Science
- Computational Theory and Mathematics
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Code-Mixed Probes Show How Pre-Trained Models Generalise on Code-Switched Text'. Together they form a unique fingerprint.Projects
- 2 Finished
-
Baskerville 2.0: Enhanced Provision for High End and On-Demand Users
Styles, I. (Principal Investigator)
Engineering & Physical Science Research Council
4/01/22 → 3/05/22
Project: Research Councils
-
Baskerville: a national accelerated compute resource
Cai, B. (Co-Investigator) & Morris, A. (Principal Investigator)
Engineering & Physical Science Research Council, Lenovo UK Limited
13/10/20 → 31/03/25
Project: Research Councils