A corpus-based developmental investigation of lexical richness and syntactic complexity in children's written stories

Ya-Ling Hsiao, Nicola J. Dawson, Nilanjana Banerji, Kate Nation

    Research output: Working paper/PreprintPreprint

    Abstract

    We analysed narrative writing development using a large corpus of short stories (N> 100,000) written by children aged 5-13 in the UK. Linguistic complexity was assessed using both lexical (N=30) and syntactic (N=14) measures. Most measures were associated with age, with older children’s writing showing greater lexical density, sophistication, and diversity than writing by younger children. Older children also used longer sentences, and longer T-units and clauses, and the density of smaller syntactic units inside larger units was also higher for older children. Principal Component Analysis identified a number of dimensions associated with complexity, with the first two dimensions capturing nearly 50% of variance. Lexical diversity was mainly represented on the first dimension and syntactic complexity on the second. Across all age categories, there was wide variation in syntactic complexity, suggesting that the ability to construct complex sentences may be less uniform across children of different ages compared to being able to use a diverse set of lexical items. We discuss the utility of analysing children’s writing development using a computational, data-driven approach.
    Original languageEnglish
    PublisherSSRN
    Number of pages42
    Publication statusPublished - 23 Aug 2022

    Fingerprint

    Dive into the research topics of 'A corpus-based developmental investigation of lexical richness and syntactic complexity in children's written stories'. Together they form a unique fingerprint.

    Cite this