This corpus contains the edited versions of transcripts of the sessions of Riigikogu. Their originals have been downloaded from http://www.riigikogu.ee/ems/plsql/ems.basdata
These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.
The corpus is free for use for non-commercial purposes only.
The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Kaarel Kaljurand; they are described in http://psych.ut.ee/~kaarel/corpus_tools/.
One file contains the transcripts of one month. There are no corrections or hyphenations in the texts. The place where the rendition of plain text changes, is tagged with <hi rend=’what kind of rendition’>; the end is tagged with </hi>.
Every file begins with a header <teiheader> that documents the contents of the file, its size, the used tags etc (in Estonian). <div0> marks the transcripts of one month; <div1> marks the transcripts of one session, and <div2> marks one item of the agenda.
The speakers are tagged with <rs> and are always with <hi rend='bold'>.
The opening quotation mark is the entity “; the closing quotation mark is the entity ”.
One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>.
The corpus contains 13 million words, covering the period from March of 1995 to the end of 2001.
The amount of words by years:
In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities: