The aim of the present study is to establish criteria for the optimal size of a corpus that can provide stable conditional probabilities of morphological and/or syntagmatic types. The optimality of corpus size is defined in terms of the smallest sample that generates probability distribution equal to distribution derived from the large sample that generates stable probabilities. The latter distribution we refer to as “target distribution”. In order to establish the above criteria we varied the sample size, the word sequence size (bigrams and trigrams), sampling procedure (randomly chosen words and continuous text) and position of the target word in a sequence. The obtained distributions of conditional probabilities derived from smaller samples have been correlated with target distributions. Sample size at which probability distribution reaches maximal correlation (r=1) with the target distribution was taken as being optimal. The research was done on Corpus of Serbian language. In case of bigrams the optimal sample size for random word selection is 65.000 words, and 281.000 words for trigrams. In contrast, continuous text sampling requires much larger samples to reach stability: 810.000 words for bigrams and 868.000 words for trigrams. The factors that caused these differences remain unclear and need additional empirical investigation.
|Publication status||Published - 2009|