Eesti keeles

Morphologically disambiguated corpus

The file contains manually disambiguated files. Every text has been manually disambiguated by two persons; and the third person has compared the result and made the necessary corrections.


The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture). The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.

The texts belong to the following text classes:

Text class number of words
Fiction (Estonian authors) 104 000
G. Orwell's "1984" 75 500
Newspaper texts 111 000
Legal texts 121 000
Texts from the scientific magazine "Horisont" 98 000
Reference texts 4 000
Altogether 513 000

File names

begin with a 3-letter code: (ilu[fiction], sea[legal texts], aja[newspaper], hor[isont], inf[reference texts], 1984).)

The origin of the texts

All the fiction texts, except for "1984", come from the subcorpus of the 1980s of the Corpus of Written Estonian 1890-1990. The number in the filename is the same as in the original, the code "tkt" or "stkt" in the original filename has been replaced with code "ilu".

The newspaper files are not present in the other corpora. The filename contains the name of the newspaper.

The reference texts come from the subcorpus of the 1980ies of the Corpus of Written Estonian 1890-1990; the file inf_0002.yhene is from the text class "Hobbies" and the file inf_0011.yhene comes from the text class "Encyclopaedias".

The legal documents come from: 1)the homepage of the Estonian Legal Language Centre (april 2002) and 2) some other resources. The filenames of the files we have got from the Estonian Legal Language Centre contain the same number as their source files. The filenames of the files coming from other sources contain the name of the legislative document.

The excerpts from the magazine "Horisont" come from its homepage (9. october 2003) and come from the years 1996-2003. The filenames are the same as they were on the homepage of "Horisont".

The analysis

The wordforms have been analysed one by one, except for some multi-word proper names like New York. The result of the analysis for one wordform is as follows:

    lemma+ending // morphological categories //

If the word-form is a compound or a derived word, then:

The tags <s> and </s> placed on separate lines mark the beginning and end of a sentence, heading etc. Some files also contain paragraph tags <p> and </p>.

Symbols and entities

In addition to letters and numbers the following symbols can be found in this corpus: ,;.:<>()!?%&"'*+-/=@_~

The non-ascii characters are represented as sgml entities. All the used entities are listed in the table of entities.

Dash can be as - or as -- and its annotation is always &mdash;. In the beginning of a list one can find combination -. and it has then received an annotation &mdash;.

The quotation marks can be in the following forms:

" double quote (beginning or end)
' single quote (beginning or end)
&ldquo; beginning double quote
&rdquo; end double quote
&lsquo; beginning single quote
&rsquo; end single quote

Known problems so far

Ca 0,3% of the analysis can be debatable or wrong.

Some publications about this

  1. H.-J. Kaalep, K. Muischnek, K. Müürisep, A. Rääbis, K. Habicht. Kas tegelik tekst allub eesti keele morfoloogilistele kirjeldustele? Eesti kirjakeele testkorpuse morfosüntaktilise märgendamise kogemusest. Keel ja Kirjandus 9/2000, lk. 623-633 doc fail, pdf fail, postscript fail
  2. K. Muischnek, K. Vider. Sõnaliigituse kitsaskohad eesti keele arvutianalüüsis esitatud avaldamiseks Rakenduslingvistika konverentsi 2004 kogumikus doc fail pdf fail

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: December 21 2018 22:05:41.