Understanding US regional linguistic variation with Twitter data analysis

Yuan Huang; Diansheng Guo; Jack Grieve; Alice Kasakoff

Understanding US regional linguistic variation with Twitter data analysis

Yuan Huang, Diansheng Guo, Jack Grieve, Alice Kasakoff

English, Drama and Creative Studies

Research output: Contribution to journal › Article › peer-review

54 Citations (Scopus)

664 Downloads (Pure)

Abstract

We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.

Original language	English
Pages (from-to)	244-255
Journal	Computers, Environment and Urban Systems
Volume	59
Early online date	31 Dec 2015
Publication status	Published - Sept 2016

Keywords

US regions
Spatial data mining
Social media
Linguistic
Twitter
American dialects
Regionalization

Access to Document

1-s2.0-S0198971515300399-main
https://doi.org/10.1016/j.compenvurbsys.2015.12.003
Final published version, 3.34 MBLicence: Creative Commons: Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)

Cite this

@article{110dc07ee9534a3ebf99729cc7b7b40f,

title = "Understanding US regional linguistic variation with Twitter data analysis",

abstract = "We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.",

keywords = "US regions, Spatial data mining, Social media, Linguistic, Twitter, American dialects, Regionalization",

author = "Yuan Huang and Diansheng Guo and Jack Grieve and Alice Kasakoff",

year = "2016",

month = sep,

language = "English",

volume = "59",

pages = "244--255",

journal = "Computers, Environment and Urban Systems",

issn = "0198-9715",

publisher = "Elsevier",

}

TY - JOUR

T1 - Understanding US regional linguistic variation with Twitter data analysis

AU - Huang, Yuan

AU - Guo, Diansheng

AU - Grieve, Jack

AU - Kasakoff, Alice

PY - 2016/9

Y1 - 2016/9

N2 - We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.

AB - We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.

KW - US regions

KW - Spatial data mining

KW - Social media

KW - Linguistic

KW - Twitter

KW - American dialects

KW - Regionalization

M3 - Article

SN - 0198-9715

VL - 59

SP - 244

EP - 255

JO - Computers, Environment and Urban Systems

JF - Computers, Environment and Urban Systems

ER -

Understanding US regional linguistic variation with Twitter data analysis

Abstract

Keywords

Access to Document

Fingerprint

Cite this