'What is this corpus about?': Using topic modeling to explore a specialised corpus
Research output: Contribution to journal › Article › peer-review
- University of Central Lancashire
The present paper introduces topic modeling, a machine learning technique that automatically identifies ‘topics’ in a given corpus. The paper illustrates its use in the exploration of an academic English corpus. It first offers the intuitive explanation of the underlying mechanism of topic modeling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterized by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, the paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modeling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modeling. We believe that topic modeling is particularly useful in the initial exploration of a corpus.
|Early online date||1 Aug 2017|
|Publication status||E-pub ahead of print - 1 Aug 2017|
- academic discourse, machine learning, semantic analysis, topic model, word co-occurrence