Eesti keeles

Estonian Reference Corpus

Estonian Reference Corpus is a big collection of Estonian texts that is under construction right now. This work has been supported by:

What does this corpus consist of?

This corpus contains only whole texts, not text samples. Here we collect the written language. For a corpus of spoken Estonian please visit the home page of the Spoken Language Group <viide: http://www.cl.ut.ee/suuline/>

At the moment, the corpus consist of the following subcorpora:

The Estonian Reference Corpus contains a more balanced subcorpus called The Balanced Corpus.

How can one use this corpus?

The corpus is free for non-commercial use. One can either:

One can reach the texts from the description of each subcorpus. Some subcorpora can't be downloaded, one can use them via the corpus query only.

Mark-up and annotation

The korpus texts are coded following the TEI guidelines.

The structure of the downloadable files is as follows:

The versioon of the korpus that one can access via our Corpus Query has one sentence per one line. The TEI-markup has been deleted except for the tag <gap type=’description of the omitted material’> that stands for the omitted text chunks.

Also, in the texts one can access via the Corpus Query, the punctuation marks are separated from the preceeding words by spaces.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: October 26 2010 14:53:49.