Eesti keeles

Estonian Wordnet

Introduction

Thesaurus is a conceptual dictionary. Words (and phrases) are organised by conceptual (semantic) links. A thesaurus in computer is a database containing information about meanings and semantic relations.

The thesaurus of Estonian at University of Tartu (TEKsaurus)

The computational linguistics group at the University of Tartu has been compiling a thesaurus of Estonian general language — Estonian WordNet (EstWN) — since 1998. The work has been lead by prof. Haldur Õim. The main editors have been Kadri Vider, Heili Orav, Leho Paldre and Neeme Kahusk.

The project has been supported by Estonian Science Foundation and Estonian Informatics Centre in the program Eesti keeletehnoloogia (Estonian Language Technology) and also State programme "Estonian Language and National Culture".

EstWN is based on the wordnet theory and we have closely followed the principles adopted in the Princeton WordNet and EuroWordNet projects.

The words included in EstWN originate from existing traditional dictionaries - mainly Explanatory Dictionary of Estonian ("Eesti Kirjakeele Seletussõnaraamat") and Estonian corpora (providing usage information), one might suppose that the semantic information in the database reflects lexical knowledge.

Experiments done in Word Sense Disambiguation of real texts have shown that the senses of the main vocabulary of Estonian have been included in EstWN.

EstWN database is available as sdb-file (see also specification) or txt-file, zipped (see also specification), distributed by ELDA.

Wordnet-type thesaurus

The atom of a wordnet-type thesaurus is a synonym set (also called a synset), which is a set containing all the synonymous words or multi-word units that express the same concept. All words in a synset belong into the same part of speech. In the simplest case, such set contains only one word, i.e. that word does not have any synonyms (the corresponding concept can be expressed by only one word).

The synsets are numbered, each of them corresponds to one record in a database. Words and phrases are numbered according to sense. The sense number indicates, that the word (phrase) can have more than one sense (it can appear in more than one synset), but words appearing in only one synset have sense numbers as well.

Diagram

EstWN currently contains ~10,000 synsets — mostly noun (66%) and verb senses (27%) are described. A small amount of adjective and proper name senses have also been included. Each synset is connected on average by 2 links, focus is on hyponymy and hypernymy relations.

The synsets are connected by links which correspond to semantic or lexical relations between concepts. The most important relations are hyponymy and hypernymy, but also meronymy, holonymy, antonymy, cause, role, derivational and gradation relations are marked. All together, approximately 60 different relations appear in EstWN.

Distribution

Acronyms

v - verb
(eg. saama) or verb phrase (eg. algust saama)
n - noun
(eg. number) or noun phrase (eg. araabia number)
a - adjective
(eg. kena) or adjective phrase
pn - proper noun
(eg. Aleksander)

Semantic relations used in Estonian WordNet

Link name in EWN Explanation Example
antonym has antonym lubama (to allow) has antonym keelama (to forbid)
be_in_state is in state of värv, värvus (color) is in state of värviline (colorful)
belongs_to_class belongs to class used to link word instance to word meaning Aleksander belongs to class mees (man)
causes causes lubama (to permit) causes luba (permission)
fuzzynym is somehow connected to kord, puhk (time) is somehow connected to moment (moment)
has_holo_location is part of a place ülikool (university) is part of a place ülikoolilinn (campus)
has_holo_madeof is material of puit (wood) is material of puu (tree)
has_holo_member is member of liige (member) is member of kollektiiv (staff)
has_holo_part is part of koht (place) is part of ruum (room)
has_holo_portion is a portion of mõte (thought) is a portion of mõttetegevus (thinking)
has_holonym is part of ühik (unit) is part of hulk (amount)
has_hyperonym is a way of [v]; is a kind of [n] lubama (to allow) is a way of soostuma (to agree); volitus (mandate) is a kind of luba (permission)
has_hyponym has a way [v]; has a special kind [n] soostuma (to agree) has a way lubama (to allow); luba (permission) has a special kind volitus (mandate)
has_instance has instance mees (man) has instance Aleksander
has_mero_location a part of place is ülikoolilinn (campus) a part of place is ülikool (university)
has_mero_madeof has part of (material) puu (tree) has part of (material) puit (wood)
has_mero_member a member is kollektiiv (staff) a member is liige (member)
has_mero_part has part ruum (room) has part koht (place)
has_mero_portion üks annus on mõttetegevus (thinking) üks annus on mõte (thought)
has_meronym has part hulk (amount) has part ühik (unit)
has_subevent has subevent otsustama (to judge) has subevent arvama (to believe, think)
has_xpos_hyperonym is a way of, is a kind of (used to link different parts of speech) taotlema (to apply) is a kind of suhtlus (communication)
has_xpos_hyponym one way is (used to link different parts of speech) suhtlus (communication) one way is mõjutama (to influence)
involved involved püsima (to stay) involved seisund (condition); teavet andma (inform) involved informatsioon (information)
involved_agent involved agent kõnelema (to speak) involved agent kõneleja (speaker)
involved_instrument involved instrument käskima (to order) involved instrument mõjujõud (influence)
involved_location involved location asuma (to situate) involved location koht, paik (location)
involved_patient involved patient rääkima (to speak) involved patient kuulaja (listener)
involved_target_direction involved target direction minema (to go) involved target direction koht (location)
is_caused_by is caused by luba (permission) is caused by lubama (to permit)
is_subevent_of is subevent of arvama (believe, think) is subevent of otsustama (to judge)
near_antonym has near antonym saabuma (to come) peaaegu has near antonym minema (to go)
near_synonym has near synonym katma (to cover) has near synonym varjama (to hide)
role plays a role teadmine (knowledge) plays a role teadma (to know)
role_agent plays a role as agent kõneleja (speaker) plays a role as agent kõnelema (to speak)
role_instrument plays a role as instrument meelitus (temptation) plays a role as instrument ahvatlema (to allure, tempt)
role_location plays a role as location koht, paik (place) plays a role as location asuma (to be, occupy a certain position)
role_patient plays a role as patient arv (number) plays a role as patient korrutama (to multiply)
role_target_direction plays a role as target direction koht (place) plays a role as target direction minema (to go)
state_of state of värviline (colorful) state of värv, värvus (color)
xpos_fuzzynym is somehow connected to õis (blossom [n]) is somehow connected to õitsema (to blossom [v])
xpos_near_antonym is almost antonym küsimus (question) is almost antonym vastama (to answer [v])
xpos_near_synonym is almost synonym liikuma (move) is almost synonym kulgemine (locomotion)

Lexical sources

Publications on EstWN

Papers

  1. Neeme Kahusk, Kadri Vider TEKsaurus - The Estonian WordNet Online. The Second Baltic Conference on Human Language Technologies, April 4-5, 2005. Proceedings, lk. 273-278.
  2. Vider, K., Kerner, K. Word Sense Disambiguation Corpus of Estonian. The Second Baltic Conference on Human Language Technologies, April 4-5, 2005. Proceedings, lk. 143-148
  3. Vider, K., Orav, H. Estonian wordnet and Lexicography. Symposium on Lexicography XI. Proceedings of the Eleventh International Symposium on Lexicography. May 2-4, 2002 at the University of Copenhagen. Ed. by H. Gottlieb, J. E. Mogensen and A. Zettersten. Sarjas: Lexicographica, Series Maior 115. Max Niemeyer Verlag, Tübingen 2005, lk. 549-555
  4. Kadri Muischnek, Heili Orav, Heiki-Jaan Kaalep, Haldur Õim. Eesti keele tehnoloogilised ressursid ja vahendid. Arvutikorpused, arvutisõnastikud, keeletehnoloogiline tarkvara. Eesti Keele Sihtasutus, Tallinn 2003
  5. Vider, K., Orav, H. Idee ja rakenduse vahe tesauruse näitel Eesti Keele Instituudi toimetised 14. Toimiv keel I. Töid rakenduslingvistika alalt. Eesti Keele Sihtasutus Tallinn 2003 lk. 313 - 322
  6. Vider, K., Orav, H. Concerning the difference between a conception and its application in the case of the Estonian wordnet Proceedings of the second international wordnet conference. Eds. P.Sojka, K. Pala, P. Smrz, Ch. Fellbaum, P. Vossen. Masaryk University, Brno, 2003, lk. 285-290
  7. Kahusk, Neeme A Lexicographer's Tool for Word Sense Tagging According to WordNet Proceedings of Workshop on Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and Evaluation; Third International Conference on Language Resources and Evaluation (LREC 2002). Toim. D. N. Christodoulakis, C. Kunze, L. Lemnitzer. ELRA, Las Palmas de Gran Canaria 2002 lk. 1-7
  8. Orav, H Adjectives in wordnet-type thesaurus: Estonian experience Kogumikus Proceedings of the 1st International Global WordNet Conference , Central Institute of Indian Languages , Mysore, India, 2002, lk. 22-25
  9. Vider, Kadri Notes about labelling semantic relations in Estonian WordNet Proceedings of Workshop on Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and Evaluation; Third International Conference on Language Resources; Third International Conference on Language Resources and Evaluation (LREC 2002). Toim. D. N. Christodoulakis, C. Kunze, L. Lemnitzer. ELRA, Las Palmas de Gran Canaria 2002 lk. 56-59
  10. Orav, H, Vider, K. Kas tesaurus ja tekstid lähevad kasutuses kokku? Kogumikus Tähendusepüüdja Catcher of the Meaning”, TÜ üldkeeleteaduse õppetooli toimetised 3, Tartu 2002, lk. 297-303
  11. Vider, Kadri Eesti keele tesaurus - teooria ja tegelikkus Leksikograafiaseminar "Sõna tänapäeva maailmas" Leksikografinen seminaari "Sanat nykymaailmassa". Ettekannete kogumik. Toim. M. Langemets. Eesti Keele Instituudi toimetised 9. Tallinn 2001 lk 134-156
  12. Orav, H Adjektiivid kui semantiline probleem: wordnet-tüüpi tesauruste koostamise kogemused Kogumikus Arvutuslingvistikalt inimesele Tartu 2000 lk 153-166
  13. Kadri Vider, Neeme Kahusk, Heili Orav, Haldur Õim, Leho Paldre Eesti keele tesaurus Kogumikus Arvutuslingvistikalt inimesele Tartu 2000 lk 127-152
  14. Orav, H., Vider, K. Estonian WordNet. Kogumikus Congressus Nonus Internationalis Fenno-Ugristarum. 7.-13.8.2000 Tartu. Pars V. Dissertationes sectionum: Linguistica II. lk. 490-497
  15. Vider, K., Orav, H. Sõna tasandilt mõiste ruumi Keel ja Kirjandus 1 1998 lk. 57-64
  16. Kadri Vider Some Problems in Estonian Wordnet Papers of the Second Swiss-Estonian Student Workshop on Computational and Theoretica and Theoretical Linguistics Zurich 1997 Electronic publication

Reports

  1. Vossen, P., C. Kunze, A. Wagner, D. Dutoit, K. Pala, P. Sevecek, K. Vider, L. Paldre, H. Orav, H. Õim 1998 Revised Set of Common Base Concepts EuroWordNet-2 (LE-8328), Deliverable 2D001, University of Amsterdam
  2. Õim, H., K. Vider, L. Paldre, H. Orav, K. Pala 1998 Specification of Czech and Estonian WNs EuroWordNet (LE-8328) Deliverable: 2D003
  3. Pala, Karel, Pavel Ševeček, Haldur Õim, Kadri Vider, Leho Paldre, Heili Orav 1998 Tools & resources Estonian & Czech WNs EuroWordNet (LE-8328) Deliverable 2D006
  4. Kunze, C., A. Wagner, D. Dutoit, L. Catherin, K. Pala, P. Sevecek, K. Vider, L. Paldre, H. Orav, H. Oim 1998 First WNs for BCs in French, German, Czech and Estonian EuroWordNet (LE-8328) Deliverable 2D007
  5. Laurent Catherin , Piek Vossen, Claudia Kunze, Andrea Wagner, Karel Pala, Kadri Vider 1999 Compared and restructured wordnets for BCs in French, German, Czeck & Estonian EuroWordNet (LE-8328) Deliverable 2D008
  6. Piek Vossen, Laura Bloksma, Wim Peters, Claudia Kunze, Andreas Wagner, Karel Pala, Kadri Vider, Francesca Bertagna 1999 Extending the Inter-Lingual-Index with new concepts EuroWordNet (LE-8328) Deliverable 2D010
  7. K. Vider, L. Paldre, H. Orav, H. Oim 1999 The Estonian Wordnet EuroWordNet (LE-8328) Deliverable 2D014

Theses and Students' papers

  1. Vider, Kadri Sagedasemad eesti verbid semantilises andmebaasis Tartu Ülikool, 1999. TÜ magistritöö. (Käsikiri, säilitatakse üldkeeleteaduse õppetoolis.)
  2. Orav, H. Eesti keele direktiivverbide semantilise välja struktuur tesaurusena (Magistritöö) Tartu 1998
  3. Kerner, Kadri Sõnatähendused tekstides ja tesauruses ühestajate erimeelsuste põhjal TÜ Bakalaureusetöö. (Käsikiri, säilitatakse üldkeeleteaduse õppetoolis.) 2004
  4. Talve, Birge Tekstide kaudu tuvastatud eesti keele tesaurusest puuduvad sõnatähendused. TÜ Bakalaureusetöö. (Käsikiri, säilitatakse üldkeeleteaduse õppetoolis.) 2005
  5. Uiboaed, Kristel Sõnaseletuste genereerimine tesauruse info põhjal. TÜ Bakalaureusetöö. (Käsikiri, säilitatakse üldkeeleteaduse õppetoolis.) 2005
  6. Pükke, Katrin Liikumise ja paiknemisega seotud verbide semantika arvutirakenduste jaoks. TÜ Bakalaureusetöö. (Käsikiri, säilitatakse üldkeeleteaduse õppetoolis.) 2005
  7. Konsap, Gaili Ilmastikunähtuste sõnavara leksikaal-semantiline analüüs eesti keeles. TÜ Bakalaureusetöö. 2008
  8. Lepla, Heigo Filmi produtseerimisega seotud mõistete leksikaal-semantiline analüüs. TÜ Bakalaureusetöö. 2008

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: October 08 2008 19:36:35.