Eesti keeles

eVikings II (Establishment of the Virtual Centre of Excellence for IST RTD in Estonia)

FP5 IST accompanying measures project IST-2001-37592

Work Package 3: Supporting RTD in language technologies

Project: Morphologically tagged and disambiguated text corpus

University of Tartu

Report

Description:

Correctly morphologically disambiguated corpora are needed for:

The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the eVikingsII project (100 000 words) and the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture) (300 000 words).

The method of the morphological disambiguation was as follows:

  1. at first the texts were processed with morphological analyser ESTMORF (developed by Filosoft Ltd - https://www.filosoft.ee)

  2. every text was disambiguated manually by two persons; and the third person has compared the result and made the necessary corrections.

The file failid.zip contains manually disambiguated files. All the files are in the folder myh01.

Example of the disambiguated sentence:

Suurtes
    suur+tes //_A_ pos pl in //
ja
    ja+0 //_J_ crd //
hallides
    hall+des //_A_ pos pl in //
teeäärsetes
    tee_äärne+tes //_A_ pos pl in //
taludes
    talu+des //_S_ com pl in //
olid
    ole+id //_V_ aux indic impf ps3 pl ps af //
elanud
    ela+nud //_V_ main partic past ps //
kulakud
    kulak+d //_S_ com pl nom //
ja
    ja+0 //_J_ crd //
raudsängijalgadesse
    raud_sängi_jalg+desse //_S_ com pl ill //
kulda
    kuld+0 //_S_ com sg part //
peitnud
    peit+nud //_V_ main partic past ps //
.
    . //_Z_ Fst //
Ühe
    üks+0 //_P_ indef sg gen //
talu
    talu+0 //_S_ com sg gen //
perenaine
    pere_naine+0 //_S_ com sg nom //
oli
    ole+i //_V_ aux indic impf ps3 sg ps af //
aga
    aga+0 //_J_ crd //
ennast
    ise+t //_P_ refl sg part //
koguni
    koguni+0 //_D_ //
sängijala
    sängi_jalg+0 //_S_ com sg gen //
külge
    külge+0 //_K_ post //
ära
    ära+0 //_D_ //
poonud
    poo+nud //_V_ main partic past ps //
.
    . //_Z_ Fst //
Mõned
    mõni+d //_P_ indef pl nom //
lagunenud
    lagunenud+0 //_A_ pos //
sängid
    säng+d //_S_ com pl nom //
vedelesid
    vedele+sid //_V_ main indic impf ps3 pl ps af //
veel
    veel+0 //_D_ //
praegugi
    praegu+gi //_D_ //
nõgestes
    nõges+tes //_S_ com pl in //
.
    . //_Z_ Fst //

There is online query for the morphologically disambiguated corpus (corpus query - Online GUI of Morphologically Disambiguated Corpora).

Amount and structure

In the course of the current project 100 000 running words of fiction (Estonian authors) texts have been disambiguated. All these texts (53 texts - every text contains approximately 2000 words) come from the subcorpus of the 1980s of the Corpus of Written Estonian (https://www.cl.ut.ee/korpused/baaskorpus/1980/.) The code numbers in the file names have been preserved; only the prefix stkt or tkt have been replaced by ilu. File names begin with a 3-letter code: (ilu[=fiction]).

Development period

M0: (Novmber 2002)– M11:(October 2003)

Developed by

The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.

Link to the corpus:

https://www.cl.ut.ee/korpused/morfkorpus/

Papers and presentations:

1) Kaalep, H-J., Muischnek, K. Inconsistent Selectional Criteria in Semi-automatic Multi-word Unit Extraction. COMPLEX 2003, 7th Conference on Computational Lexicography and Corpus Research, Ed. By F. Kiefer, J.Pajzs, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest 2003, lk. 27-36 rtf file, pdf file, postscript file

2) Heiki-Jaan Kaalep, Kadri Muischnek. Frequency Dictionary of Written Estonian of the 1990ies. Kogumikus: The First Baltic Conference. Human Language Technologies. The Baltic Perspective. Commission of the Official Language at the Chancellery of the President of Latvia, Riga, 2004 lk. 57-60 doc file pdf file postscript file


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: October 11 2018 19:27:06.