'What is this corpus about?': Using topic modelling to explore a specialised corpus

Akira Murakami; Paul Thompson; Susan Hunston; Dominik Vajn

doi:10.3366/cor.2017.0118

'What is this corpus about?': Using topic modelling to explore a specialised corpus

Akira Murakami, Paul Thompson, Susan Hunston, Dominik Vajn

Research output: Contribution to journal › Article › peer-review

19 Citations (Scopus)

429 Downloads (Pure)

Abstract

This paper introduces topic modelling, a machine learning technique that automatically identifies ‘topics’ in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling. We believe that topic modelling is particularly useful in the initial exploration of a corpus.

Original language	English
Pages (from-to)	243-277
Journal	Corpora
Volume	12
Issue number	2
Early online date	1 Aug 2017
DOIs	https://doi.org/10.3366/cor.2017.0118
Publication status	E-pub ahead of print - 1 Aug 2017

Keywords

academic discourse
machine learning
semantic analysis
topic model
word co-occurrence

Access to Document

10.3366/cor.2017.0118Licence: Creative Commons: Attribution (CC BY)

murakamia2017whatisFinal published version, 1.03 MBLicence: Creative Commons: Attribution (CC BY)

Cite this

@article{c971dbe297a4477cbccd2ac9e5f1da83,

title = "'What is this corpus about?': Using topic modelling to explore a specialised corpus",

abstract = "This paper introduces topic modelling, a machine learning technique that automatically identifies {\textquoteleft}topics{\textquoteright} in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling. We believe that topic modelling is particularly useful in the initial exploration of a corpus.",

keywords = "academic discourse, machine learning, semantic analysis, topic model, word co-occurrence",

author = "Akira Murakami and Paul Thompson and Susan Hunston and Dominik Vajn",

year = "2017",

month = aug,

day = "1",

doi = "10.3366/cor.2017.0118",

language = "English",

volume = "12",

pages = "243--277",

journal = "Corpora",

issn = "1749-5032",

publisher = "Edinburgh University Press",

number = "2",

}

TY - JOUR

T1 - 'What is this corpus about?'

T2 - Using topic modelling to explore a specialised corpus

AU - Murakami, Akira

AU - Thompson, Paul

AU - Hunston, Susan

AU - Vajn, Dominik

PY - 2017/8/1

Y1 - 2017/8/1

N2 - This paper introduces topic modelling, a machine learning technique that automatically identifies ‘topics’ in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling. We believe that topic modelling is particularly useful in the initial exploration of a corpus.

AB - This paper introduces topic modelling, a machine learning technique that automatically identifies ‘topics’ in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling. We believe that topic modelling is particularly useful in the initial exploration of a corpus.

KW - academic discourse

KW - machine learning

KW - semantic analysis

KW - topic model

KW - word co-occurrence

U2 - 10.3366/cor.2017.0118

DO - 10.3366/cor.2017.0118

M3 - Article

SN - 1749-5032

VL - 12

SP - 243

EP - 277

JO - Corpora

JF - Corpora

IS - 2

ER -

'What is this corpus about?': Using topic modelling to explore a specialised corpus

Abstract

Keywords

Access to Document

Fingerprint

Cite this