Cost-sensitive BERT for generalisable sentence classification on imbalanced data

Harish Tayyar Madabushi; Elena Kochkina; Michael Castelle

doi:10.18653/v1/D19-5018

Cost-sensitive BERT for generalisable sentence classification on imbalanced data

Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second highest score on sentence-level propaganda classification.

Original language	English
Title of host publication	Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Editors	Anna Feldman, Giovanni Da San Martino, Alberto Barron-Cedeno, Chris Brew, Chris Leberknight, Preslav Nakov
Publisher	Association for Computational Linguistics, ACL
Pages	125-134
Number of pages	10
ISBN (Print)	9781950737895
DOIs	https://doi.org/10.18653/v1/D19-5018
Publication status	Published - 4 Nov 2019
Event	Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda - , Hong Kong Duration: 4 Nov 2019 → 4 Nov 2019

Conference

Conference	Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Country/Territory	Hong Kong
Period	4/11/19 → 4/11/19

Access to Document

10.18653/v1/D19-5018Licence: Creative Commons: Attribution (CC BY)

https://www.aclweb.org/anthology/D19-5018Licence: Creative Commons: Attribution (CC BY)

Cite this

Tayyar Madabushi, H., Kochkina, E., & Castelle, M. (2019). Cost-sensitive BERT for generalisable sentence classification on imbalanced data. In A. Feldman, G. Da San Martino, A. Barron-Cedeno, C. Brew, C. Leberknight, & P. Nakov (Eds.), Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda (pp. 125-134). Association for Computational Linguistics, ACL. https://doi.org/10.18653/v1/D19-5018

Tayyar Madabushi, Harish ; Kochkina, Elena ; Castelle, Michael. / Cost-sensitive BERT for generalisable sentence classification on imbalanced data. Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. editor / Anna Feldman ; Giovanni Da San Martino ; Alberto Barron-Cedeno ; Chris Brew ; Chris Leberknight ; Preslav Nakov. Association for Computational Linguistics, ACL, 2019. pp. 125-134

@inproceedings{c124e221c1ea4dd28a8fdcd1c1d3a1d7,

title = "Cost-sensitive BERT for generalisable sentence classification on imbalanced data",

abstract = "The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second highest score on sentence-level propaganda classification.",

author = "{Tayyar Madabushi}, Harish and Elena Kochkina and Michael Castelle",

year = "2019",

month = nov,

day = "4",

doi = "10.18653/v1/D19-5018",

language = "English",

isbn = "9781950737895",

pages = "125--134",

editor = "Anna Feldman and {Da San Martino}, Giovanni and Alberto Barron-Cedeno and Chris Brew and Chris Leberknight and Preslav Nakov",

booktitle = "Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda",

publisher = "Association for Computational Linguistics, ACL",

note = "Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda ; Conference date: 04-11-2019 Through 04-11-2019",

}

Tayyar Madabushi, H, Kochkina, E & Castelle, M 2019, Cost-sensitive BERT for generalisable sentence classification on imbalanced data. in A Feldman, G Da San Martino, A Barron-Cedeno, C Brew, C Leberknight & P Nakov (eds), Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. Association for Computational Linguistics, ACL, pp. 125-134, Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, Hong Kong, 4/11/19. https://doi.org/10.18653/v1/D19-5018

Cost-sensitive BERT for generalisable sentence classification on imbalanced data. / Tayyar Madabushi, Harish; Kochkina, Elena; Castelle, Michael.
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. ed. / Anna Feldman; Giovanni Da San Martino; Alberto Barron-Cedeno; Chris Brew; Chris Leberknight; Preslav Nakov. Association for Computational Linguistics, ACL, 2019. p. 125-134.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Cost-sensitive BERT for generalisable sentence classification on imbalanced data

AU - Tayyar Madabushi, Harish

AU - Kochkina, Elena

AU - Castelle, Michael

PY - 2019/11/4

Y1 - 2019/11/4

N2 - The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second highest score on sentence-level propaganda classification.

AB - The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second highest score on sentence-level propaganda classification.

U2 - 10.18653/v1/D19-5018

DO - 10.18653/v1/D19-5018

M3 - Conference contribution

SN - 9781950737895

SP - 125

EP - 134

BT - Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

A2 - Feldman, Anna

A2 - Da San Martino, Giovanni

A2 - Barron-Cedeno, Alberto

A2 - Brew, Chris

A2 - Leberknight, Chris

A2 - Nakov, Preslav

PB - Association for Computational Linguistics, ACL

T2 - Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

Y2 - 4 November 2019 through 4 November 2019

ER -

Tayyar Madabushi H, Kochkina E, Castelle M. Cost-sensitive BERT for generalisable sentence classification on imbalanced data. In Feldman A, Da San Martino G, Barron-Cedeno A, Brew C, Leberknight C, Nakov P, editors, Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. Association for Computational Linguistics, ACL. 2019. p. 125-134 doi: 10.18653/v1/D19-5018

Cost-sensitive BERT for generalisable sentence classification on imbalanced data

Abstract

Conference

Access to Document

Fingerprint

Cite this