Estonian Reference Corpus
Estonian Reference Corpus is a big collection of Estonian texts that is under construction right now. This work has been supported by:
- national programme «Estonian Language and National Culture (Eesti keel ja rahvuskultuur)»
- national programme
«The Estonian Language and National Memory»
- national programme «Estonian Language Technology»
What does this corpus consist of?
This corpus contains only whole texts, not text samples. Here we collect the written language. For a corpus of spoken Estonian please visit the home page of the Spoken Language Group <viide: http://www.cl.ut.ee/suuline/>
At the moment, the corpus consist of the following subcorpora:
- Fiction from the year 1990 onwards (5,6 million words)
- Daily«Postimees» (issues 27.11.1995 - 10.10.2000, 1760 issues containing 88 600 articles, 32.9 million words)
- Weekly«Eesti Ekspress» (issues 09.08.1996 - 29.11.2001, 7.5 million words)
- Daily «Eesti Päevaleht» (issues 18.10.1995 - 31.10.2007;
(4065 issues containing 366862 articles), 87,9 million words)
- Magazine «Maaleht» (2001 - 2004, 4.3 million words)
- Magazine «SL Õhtuleht» (1997 - 2007, 45.5 million words)
- Magazine «Horisont» (1996 - 2003, 260 000 words)
- Magazine «Luup» (1996 - 2002, 1,9 million words)
- Magazine «Kroonika» (2001 - 2003, 600 thousand words)
- Magazine «Eesti Arst» 2002 - 2004 (ca 0,7 million words)
- Magazine «Arvutitehnika ja Andmetöötlus» (1999 - 2005. 625 thousand words)
- Magazine «Agraarteadus» (2001 - 2006. 298 thousand words)
- Various cientific articles (ca 1.3 million words)
- Estonian and European legal documents (ca 1.8 million and 10 million words)
- New media (ca 21 million words)
- Parliament transcripts 1995-2001 (13 million words)
- PhD dissertations (0.5 million words)
The Estonian Reference Corpus contains a more balanced subcorpus called The Balanced Corpus.
How can one use this corpus?
The corpus is free for non-commercial use.
One can either:
- use the corpus query
- download the compressed texts.
- use Keeleveeb’s corpus query to retrieve concordances of lemmas, word-classes and grammatical categories or their co-occurences
One can reach the texts from the description of each subcorpus. Some subcorpora can't be downloaded, one can use them via the
corpus query only.
Mark-up and annotation
The korpus texts are coded following the TEI guidelines.
The structure of the downloadable files is as follows:
- Each korpus file begins with a header
<teiheader>. The header documents the name of the text(s) in the file, the extent of the file in words and in bytes and lists the used tags.
- The text itself begins with the tags
<text><body>. In every text, at least the heades <head>, passages <p> and sentences <s> have been marked. The rest of the annotation can be different in different subcorpora.
- The punctuation marks have been separated from the proceeding words by spaces, so the sentence that appears in the ordinary text as:
Ma nägin, et ta tuleb, ja ütlesin: "Tere!"
In our corpus looks like that:
Ma nägin , et ta tuleb , ja ütlesin : " Tere ! "
The versioon of the korpus that one can access via our Corpus Query has one sentence per one line. The TEI-markup has been deleted except for the tag <gap type=’description of the omitted material’> that stands for the omitted text chunks.
Also, in the texts one can access via the Corpus Query, the punctuation marks are separated from the preceeding words by spaces.
Webmaster
Last modified: October 26 2010 14:53:49.