This subcorpus contains issues of the newspaper „Valgamaalane“ (local newspaper of the Valga county) from the period 02.09.2004 - 31.07.2008, (598 issues 10 577 articles), 2 495 302 words in 182 936 sentences.
The texts have been semi-automatically downloaded and converted from HTML-format to TEI-format. The programs have been written and conversions done by Kristel Uiboaed.
From the newspaper texts non-textual material has been omitted. By non-textual material we mean pictures (photos, drawings, diagrams etc). We have also omitted articles containing of tables only, like various sports results tables or TV-programs. And lastly, we have omitted weather forecasts and horoscopes.The corpus is free for use for non-commercial purposes only.
Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the newspaper.
Every file begins with a header <teiheader> that contains information about file size, used tags etc.
The rest of the file is structured as follows:
<div0> is one issue of the newspaper, e.g. <div0 type='leht'><head> Valgamaalane 15.04.2008 </head> <div1> is a section e.g. <div1 type='rubriik'><head>Valgamaa</head> <div2> is an article, e.g. <div2 type='artikkel'><head> Tervishoid vajab rohkem riigi abi </head> The text has been annotated for paragraphs, sentences, headlines and authors.
The non-ASCII characters/symbols are presented using the following entities:
| Entity | Symbol | Estonian description |
|---|---|---|
| А | A | kirillitsa suur A |
| Ā | Ā | |
| Å | Å | |
| Ä | Ä | |
| Č | Č | |
| É | É | |
| И | И | |
| Н | Н | kirillitsa suur EN |
| Õ | Õ | |
| Ö | Ö | |
| Š | Š | |
| Ų | Ų | |
| Ü | Ü | |
| В | В | kirillitsa suur VE |
| Ž | Ž | |
| ā | ā | |
| & | & | |
| ą | ą | |
| å | å | |
| * | * | |
| @ | @ | |
| ä | ä | |
| ć | ć | |
| č | č | |
| ° | ° | |
| é | é | |
| ė | ė | |
| ē | ē | |
| ½ | ½ | |
| ¾ | ¾ | |
| &gcedil; | Ģ | väike Ģ |
| > | > | |
| … | … | |
| ī | ī | |
| į | į | |
| ķ | ķ | |
| ļ | ļ | |
| “ | “ | |
| _ | _ | |
| < | < | |
| · | · | |
| ņ | ņ | |
| – | – | |
| ó | ó | |
| õ | õ | |
| ö | ö | |
| % | % | |
| + | + | |
| " | " | |
| ř | ř | |
| ŗ | ŗ | |
| ” | ” | |
| š | š | |
| § | § | |
| ¹ | 1 | |
| ² | ² | |
| ³ | ³ | |
| × | × | |
| ū | ū | |
| ų | ų | |
| ü | ü | |
| ž | ž |