Developing and validating a mid-frequency word list for chemistry: a corpus-based approach using big data

Ismail Xodabande*, Mahmood Reza Atai, Mohammad R. Hashemi, Paul Thompson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

60 Downloads (Pure)

Abstract

Given the importance of specialized vocabulary in scientific communication and academic discourse, there is a growing need to create wordlists to address the vocabulary-learning needs of university students and researchers in different subject areas. The current study analyzed a corpus of chemistry research articles (with 278 million running words) to establish a mid-frequency vocabulary list for this field. Using frequency, range, and dispersion criteria, the study identified 560 lemmas in the fourth to the ninth British National Corpus/Corpus of Contemporary American English (BNC/COCA) lists that provided 6.4% coverage of all words in the corpus. The list was validated using specialized and general corpora, and the results confirmed the value and relevance of the items for chemistry. Moreover, for using the list for pedagogical goals, the vocabulary items were divided into five bands based on their coverage and importance. The 100 words in the first band were the most important mid-frequent vocabulary in chemistry, as they provided 3.05% coverage. The study highlights the significant contribution of mid-frequency words in research articles and the findings have implications for using large corpora as a big data source in identifying specialized and field-specific vocabulary.
Original languageEnglish
Article number32
Number of pages21
JournalAsian-Pacific Journal of Second and Foreign Language Education
Volume8
Issue number1
DOIs
Publication statusPublished - 1 Oct 2023

Keywords

  • Corpus linguistics
  • Academic vocabulary
  • EAP/ESP vocabulary
  • Big data
  • Wordlist
  • Chemistry
  • Mid-frequency vocabulary
  • Research article

Fingerprint

Dive into the research topics of 'Developing and validating a mid-frequency word list for chemistry: a corpus-based approach using big data'. Together they form a unique fingerprint.

Cite this