Studying language typology by means of corpora


Lumme Erilt

University of Tartu

Department of English


  1. Introduction


In the present paper I am going to discuss some aspects of a study in progress, namely that of my Master’s thesis ‘A quantitative-typological study of Old English’ that I am currently writing for the English Department at the University of Tartu. To begin with, I shall refer to the theoretical framework for the study and try to explain why I became interested in diachronic typology. Secondly, I shall describe the primary source of the study, the Helsinki Corpus and more specifically the Old English part of it. Then I shall discuss some problems that I encountered in the preparation of the texts in the corpus for the computation of word frequencies. Finally I shall point to the possible analysis of the data for the benefit of linguistic typology.


  1. Theoretical framework


Borrowing a thought from Lyons (Lyons 1981:10), the aim of linguistics is to describe language competence as opposed to language performance, i.e. not all the actual sentences of a language, but all the potential ones as determined by general rules.


It seems to me that the same terminological opposition might be applied to the description of two theories of language universals that arose in the mid 1960s, namely Chomskyian Generative Grammar (Chomsky 1965) and Greenbergian Linguistic Typology (Greenberg 1966). Generative Grammar is interested in language competence in a fairly straightforward manner -- relying on introspection and speaker’s intuition, it is subjective and not empirical in the sense that it does not build itself on large amounts of heterogeneous data but rather on few languages or standard dialects. Language Typology is first and foremost interested in language performance, which is based on empirical description and comparison of a great number of languages. This theory assumes that only through generalising on large amounts of variable data it is possible to get an idea of language competence and language universals. Recent developments in corpus and computerlinguistics have made the systematic analysis of large bodies of text samples easier and faster, thus enabling not only intra-lingual analysis but also inter-lingual or cross-linguistic comparison.


One of the assumptions in the universalist research is the belief that all the languages of the world, ancient or modern are intrinsically similar and exhibit similar complexity of system that, although different on the surface goes back to similar deep structure. Following this principle we can compare on a par Modern High German and Modern English and Modern High German and Old or Middle High German, as all these are considered independent stages of languages which reveal the same basic underlying structure.


This theoretical framework forms the background of my study of Old English. Old English, also known as Anglo-Saxon, is a term which is used to refer to the language spoken in the British Isles from the 5th to the 12th century by three Germanic tribes -- Angles, Saxons and Jutes. Although not utterly homogeneous, the dialectal differences have not been that great as to disturb mutual understanding.


  1. Sources


As a primary source I have used the Old English part of the Helsinki Corpus of English Texts (Kytö 1991) This corpus contains texts from Old, Middle and Early Modern English periods and early texts from the Scottish and American varieties of English. Recently, the Middle English part of the corpus was syntactically parsed and is now available under the title of Penn-Helsinki Parsed Corpus. The morphological and syntactic analysis for the Old English part is currently carried out in Amsterdam and York resulting some time in the future in the Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus. It must be stressed, though, that the tagging and parsing processes of the diachronic corpora are very laborious and time consuming, mainly because of the lack of standard spelling and so most of the work is done practically manually.


The Old English period of the Helsinki Corpus consists of four sub-periods:

OE1 (up to year 850);

OE2 (yrs. 850-950);

OE3 (yrs. 950-1050);

OE4 (yrs. 1050-1150).

At the initial stage of the study I intended to use all the texts from all four periods, but I ended up with the periods OE2 and OE4, 92,050 words and 67,380 words respectively. Period OE3 was discarded because it comprised a great deal of poetry with significantly different syntax and lexical content and period OE1 due to its shortness (2190 words only), though a pilot study was made on it. The time gap between the periods OE2 and OE4 provided a suitable diachronic perspective and could reveal some typological change, if any.

The texts from the two sub-periods were kept apart, while no distinction what so ever was made between different text types and categories. One of the reasons for such a decision was that with the aim of gaining a representative picture of Old English, the compilers of the corpus have tried to include proportionate representation of different text genres, i.e. they have tried to diminish the proportion of religious and historical texts that prevail among the texts that have been preserved from those times. The other reason for not distinguishing between the text types was a somewhat naive hope to get an objective picture of a language as a whole. On the basis of these texts frequency lists were made up. At this moment some important problems needed to be solved.


4. De-coding


Firstly, the texts in the Helsinki Corpus are preceded by 24 reference codes in COCOA format which give the textual parameters of texts or text groups (Kytö 1991:42). The codes are given in angular brackets. These codes had to be removed in order to get pure frequency lists, i.e. lists not containing these reference codes.


Secondly, inside the texts, so called 'text level' codes had been added by the compilers of the corpus (Kytö 1991:28 ff.). These codes were designed to mark text in foreign language, runes, emendations, editor's comments, compilers' comments, fonts other than basic fonts (e.g. italics) and headings. Foreign language text mostly included parallel text or translation from Latin. These codes and comments inside them were likewise excluded for the present study. The question of excluding emendations, though, is somewhat problematic, because these might have included essential vocabulary items.


For those purposes a command in the shell script of UNIX programming language was devised. This command removed also punctuation marks and transformed capital letters into lower-case ones, so that words like






all meaning 'the Lord' were from now onwards considered as one and the same word by the computer.


  1. Normalisation


In addition to the extra-linguistic problems that have been mentioned so far, intra-linguistic irregularities appeared, mainly caused by the non-existent spelling standard in Old English and various dialects. It is well-known that most of the manuscripts that have been preserved from the Old English period are in West-Saxon dialect (as visible, for example in the Toronto Corpus that contains all the existing Old English texts (v. Healey & Venezky 1980). Yet, one of the chief aims of the compilers' of the Helsinki Corpus has been to include as many texts from different Old English dialects as possible and thus get more objective picture of the dialectal variation in Anglo-Saxon England. For the purpose of the present study, though, dialectal forms had to be standardised to one sole form and spelling irregularities, rising from various scribal traditions and lack of tradition in many case, had to be normalised.


"In essence the process of normalisation," as Raymond Hickey (1994:169) argues, "consists of replacing variants of a grammatical form by a single form by external consensus, e.g. as the latter is the input to a later standard form [...]". Hickey also warns of "almost ideological dislike of normalisation, particularly on the part of medieval scholars" (ibid.), but in spite of that, he acknowledges, there are some obvious advantages. Normalisation in this particular case enabled me to find out frequency information about different word forms, i.e. not all possible spellings. So, for example, words like





meaning 'woman', had to be considered as one and the same word, or,







all meaning ‘so’ had to be standardised to one form swa.


Normalisation in this case did not mean modernisation (to Modern English, e.g.), but rather standardisation. The standard or 'norm' was taken to be the West-Saxon dialect of English. In practice the normalisation procedure mostly relied on the forms given in the Concise Dictionary of Old English (Clark Hall 1960), if not included there, on An Anglo-Saxon dictionary (Bosworth & Toller 1898). In some rare cases like paradigms of pronouns, the forms were normalised after the Old English grammar (Campbell 1959). As all these sources have taken over the West-Saxon standard, so the normalisation-standardisation process meant in practice "translation" of the texts into West-Saxon.


Besides these, following normalisations should be mentioned:


The normalisation was in its essence semi-automatic or computer-assisted because, for example, changing automatically all words of the shape mon would have changed also words like cumon or monandæg. It was often important to check the context of each word separately to find out the 'type' to which a variant should be normalised.


Due to these normalisations and changes the total length of the texts changed as well. In stead of the original 92,050 tokens in OE2, the amount of words now was 91,044 (the big difference between the two figures is due to the large proportions of parallel translation into Latin that was excluded from the Vespasian Psalter) and for the OE4 these figures changed from the previous 67,380 to 67,206 tokens.


On the basis of those modified texts frequency lists based on the word forms were created. It is to be hoped that the morphological annotation prepared for the Brooklyn-Geneva-Amsterdam-Helsinki Parsed corpus of Old English (see Pintzuk & Taylor, forthcoming) will make similar analysis possible on frequencies of lemmas as well.


  1. Analysis


The frequency lists of Old English were made on two purposes, both contributing to the study of linguistic typology and universals .

First, I was interested in relative frequencies as possible indicators of morphological type. This line of thought follows the studies made by Tuldava (Tuldava 1977, 1995) and Bektaev (Bektaev 1978) and is based on the idea that the morphological type of a langugage is expressed by several quantitative parameters of the type/token frequency contrast and the amount of text they cover. Thus those parameters should also help to determine the morphological type of a particular language. The diachronic aspect of the study was in detecting possible typological change from an earlier period of language (OE2) to later (OE4).

My second aim was to put the relative frequencies into the use of the theory of markedness and study markedness hierarchies in contrasting relative frequencies of members of some linguistic categories. This idea is based on the fact that grammatically and semantically unmarked words appear more often than their marked counterparts (see e.g. Croft 1990, Halliday 1991).

The discussion of the analysis has to be left for another time.



