Eesti keeles

The Mixed Corpus: Valgamaalane


This subcorpus contains issues of the newspaper „Valgamaalane“ (local newspaper of the Valga county) from the period 02.09.2004 - 31.07.2008, (598 issues 10 577 articles), 2 495 302 words in 182 936 sentences.

The texts have been semi-automatically downloaded and converted from HTML-format to TEI-format. The programs have been written and conversions done by Kristel Uiboaed.

From the newspaper texts non-textual material has been omitted. By non-textual material we mean pictures (photos, drawings, diagrams etc). We have also omitted articles containing of tables only, like various sports results tables or TV-programs. And lastly, we have omitted weather forecasts and horoscopes.

The corpus is free for use for non-commercial purposes only.

Texts and annotation

Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the newspaper.

Every file begins with a header <teiheader> that contains information about file size, used tags etc.

The rest of the file is structured as follows:

The text has been annotated for paragraphs, sentences, headlines and authors.


SGML-files contain entities listed in this table

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: December 21 2018 18:32:15.