Reference corpus of Estonian: Transcripts of Riigikogu (Estonian Parliament)

Content

This corpus contains the edited versions of transcripts of the sessions of Riigikogu. Their originals have been downloaded from https://www.riigikogu.ee/ems/plsql/ems.basdata

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

The corpus is free for use for non-commercial purposes only.

Sources and markup

The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Kaarel Kaljurand.

One file contains the transcripts of one month. There are no corrections or hyphenations in the texts. The place where the rendition of plain text changes, is tagged with <hi rend=’what kind of rendition’>; the end is tagged with </hi>.

Every file begins with a header <teiheader> that documents the contents of the file, its size, the used tags etc (in Estonian). <div0> marks the transcripts of one month; <div1> marks the transcripts of one session, and <div2> marks one item of the agenda.

The speakers are tagged with <rs> and are always with <hi rend='bold'>.

The opening quotation mark is the entity “; the closing quotation mark is the entity ”.

One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>.

The corpus contains 13 million words, covering the period from March of 1995 to the end of 2001.

The amount of words by years:

1995 - 1,2 million
1996 - 1,8 million
1997 - 1,8 million
1998 - 1,9 million
1999 - 1,8 million
2000 - 2,2 million
2001 - 2,2 million

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:

Aring - Å
Auml - Ä
Ccaron - &Ccaron;
Egrave - È
Otilde - Õ
Ouml - Ö
Scaron - Š
Uuml - Ü
Zcaron - &Zcaron;
aacute - á
agrave - à
amp - &
atilde - ã
auml - ä
ccaron - &ccaron;
ccedil - ç
deg - °
eacute - é
egrave - è
iacute - í
ldquo - “
lstrok - &lstrok;
ntilde - ñ
oacute - ó
oslash - ø
otilde - õ
ouml - ö
rdquo - ”
scaron - š
sect - §
uacute - ú
uuml - ü
zcaron - &zcaron;

Webmaster Last modified: December 21 2018 21:22:08.