University of Tartu
Correctly morphologically disambiguated corpora are needed for:
input of syntactic analyser
input of semantic analyser;
developing of automatic morphological disambiguator
compiling frquency dictionaries;
linguistic research
The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the eVikingsII project (100 000 words) and the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture) (300 000 words).
The method of the morphological disambiguation was as follows:
at first the texts were processed with morphological analyser ESTMORF (developed by Filosoft Ltd - https://www.filosoft.ee)
every text was disambiguated manually by two persons; and the third person has compared the result and made the necessary corrections.
The file failid.zip contains manually disambiguated files. All the files are in the folder myh01.
Example of the disambiguated sentence:
Suurtes
suur+tes
//_A_ pos pl in //
ja
ja+0
//_J_ crd //
hallides
hall+des
//_A_ pos pl in //
teeäärsetes
tee_äärne+tes
//_A_ pos pl in //
taludes
talu+des
//_S_ com pl in //
olid
ole+id
//_V_ aux indic impf ps3 pl ps af //
elanud
ela+nud
//_V_ main partic past ps //
kulakud
kulak+d
//_S_ com pl nom //
ja
ja+0
//_J_ crd //
raudsängijalgadesse
raud_sängi_jalg+desse
//_S_ com pl ill //
kulda
kuld+0
//_S_ com sg part //
peitnud
peit+nud
//_V_ main partic past ps //
.
.
//_Z_ Fst //
Ühe
üks+0
//_P_ indef sg gen //
talu
talu+0
//_S_ com sg gen //
perenaine
pere_naine+0
//_S_ com sg nom //
oli
ole+i
//_V_ aux indic impf ps3 sg ps af //
aga
aga+0
//_J_ crd //
ennast
ise+t
//_P_ refl sg part //
koguni
koguni+0
//_D_ //
sängijala
sängi_jalg+0
//_S_ com sg gen //
külge
külge+0
//_K_ post //
ära
ära+0
//_D_ //
poonud
poo+nud
//_V_ main partic past ps //
.
.
//_Z_ Fst //
Mõned
mõni+d
//_P_ indef pl nom //
lagunenud
lagunenud+0
//_A_ pos //
sängid
säng+d
//_S_ com pl nom //
vedelesid
vedele+sid
//_V_ main indic impf ps3 pl ps af //
veel
veel+0
//_D_ //
praegugi
praegu+gi
//_D_ //
nõgestes
nõges+tes
//_S_ com pl in //
.
.
//_Z_ Fst //
There is online query for the morphologically disambiguated corpus (corpus query - Online GUI of Morphologically Disambiguated Corpora).
In the course of the current project 100 000 running words of fiction (Estonian authors) texts have been disambiguated. All these texts (53 texts - every text contains approximately 2000 words) come from the subcorpus of the 1980s of the Corpus of Written Estonian (https://www.cl.ut.ee/korpused/baaskorpus/1980/.) The code numbers in the file names have been preserved; only the prefix stkt or tkt have been replaced by ilu. File names begin with a 3-letter code: (ilu[=fiction]).
M0: (Novmber 2002)– M11:(October 2003)
The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.
Link to the corpus:
https://www.cl.ut.ee/korpused/morfkorpus/
1) Kaalep, H-J., Muischnek, K. Inconsistent Selectional Criteria in Semi-automatic Multi-word Unit Extraction. COMPLEX 2003, 7th Conference on Computational Lexicography and Corpus Research, Ed. By F. Kiefer, J.Pajzs, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest 2003, lk. 27-36 rtf file, pdf file, postscript file
2) Heiki-Jaan Kaalep, Kadri Muischnek. Frequency Dictionary of Written Estonian of the 1990ies. Kogumikus: The First Baltic Conference. Human Language Technologies. The Baltic Perspective. Commission of the Official Language at the Chancellery of the President of Latvia, Riga, 2004 lk. 57-60 doc file pdf file postscript file