Start People Corpora Resources

eVikings II (Establishment of the Virtual Centre of Excellence for IST RTD in Estonia)

FP5 IST accompanying measures project IST-2001-37592

Work Package 3: Supporting RTD in language technologies

Project: Morphologically tagged and disambiguated text corpus

University of Tartu

Report

Description:

Correctly morphologically disambiguated corpora are needed for:

input of syntactic analyser
input of semantic analyser;
developing of automatic morphological disambiguator
compiling frquency dictionaries;
linguistic research

The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the eVikingsII project (100 000 words) and the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture) (300 000 words).

The method of the morphological disambiguation was as follows:

at first the texts were processed with morphological analyser ESTMORF (developed by Filosoft Ltd - https://www.filosoft.ee)
every text was disambiguated manually by two persons; and the third person has compared the result and made the necessary corrections.

The file failid.zip contains manually disambiguated files. All the files are in the folder myh01.

Example of the disambiguated sentence:

Suurtes suur+tes //_A_ pos pl in // ja ja+0 //_J_ crd // hallides hall+des //_A_ pos pl in // teeäärsetes tee_äärne+tes //_A_ pos pl in // taludes talu+des //_S_ com pl in // olid ole+id //_V_ aux indic impf ps3 pl ps af // elanud ela+nud //_V_ main partic past ps // kulakud kulak+d //_S_ com pl nom // ja ja+0 //_J_ crd // raudsängijalgadesse raud_sängi_jalg+desse //_S_ com pl ill // kulda kuld+0 //_S_ com sg part // peitnud peit+nud //_V_ main partic past ps // . . //_Z_ Fst // Ühe üks+0 //_P_ indef sg gen // talu talu+0 //_S_ com sg gen // perenaine pere_naine+0 //_S_ com sg nom // oli ole+i //_V_ aux indic impf ps3 sg ps af // aga aga+0 //_J_ crd // ennast ise+t //_P_ refl sg part // koguni koguni+0 //_D_ // sängijala sängi_jalg+0 //_S_ com sg gen // külge külge+0 //_K_ post // ära ära+0 //_D_ // poonud poo+nud //_V_ main partic past ps // . . //_Z_ Fst // Mõned mõni+d //_P_ indef pl nom // lagunenud lagunenud+0 //_A_ pos // sängid säng+d //_S_ com pl nom // vedelesid vedele+sid //_V_ main indic impf ps3 pl ps af // veel veel+0 //_D_ // praegugi praegu+gi //_D_ // nõgestes nõges+tes //_S_ com pl in // . . //_Z_ Fst //

There is online query for the morphologically disambiguated corpus (corpus query - Online GUI of Morphologically Disambiguated Corpora).

Amount and structure

In the course of the current project 100 000 running words of fiction (Estonian authors) texts have been disambiguated. All these texts (53 texts - every text contains approximately 2000 words) come from the subcorpus of the 1980s of the Corpus of Written Estonian (https://www.cl.ut.ee/korpused/baaskorpus/1980/.) The code numbers in the file names have been preserved; only the prefix stkt or tkt have been replaced by ilu. File names begin with a 3-letter code: (ilu[=fiction]).

Development period

M0: (Novmber 2002)– M11:(October 2003)

Developed by

The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.

Link to the corpus:

https://www.cl.ut.ee/korpused/morfkorpus/

Papers and presentations:

1) Kaalep, H-J., Muischnek, K. Inconsistent Selectional Criteria in Semi-automatic Multi-word Unit Extraction. COMPLEX 2003, 7th Conference on Computational Lexicography and Corpus Research, Ed. By F. Kiefer, J.Pajzs, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest 2003, lk. 27-36 rtf file, pdf file, postscript file

2) Heiki-Jaan Kaalep, Kadri Muischnek. Frequency Dictionary of Written Estonian of the 1990ies. Kogumikus: The First Baltic Conference. Human Language Technologies. The Baltic Perspective. Commission of the Official Language at the Chancellery of the President of Latvia, Riga, 2004 lk. 57-60 doc file pdf file postscript file

Webmaster Last modified: October 11 2018 19:27:06.