This subcorpus contains issues of the newspaper „Lääne Elu“ (local newspaper of the Läänemaa county) from the period 04.05.2000 – 01.11.2008, (1273 issues, 6407 articles), 1 764 250 words in 126 205 sentences. The texts have been semi-automatically downloaded and converted from HTML-format to TEI-format. The programs have been written and conversions done by Kristel Uiboaed.
From the newspaper texts non-textual material has been omitted. By non-textual material we mean pictures (photos, drawings, diagrams etc). We have also omitted articles containing of tables only, like various sports results tables or TV-programs. And lastly, we have omitted weather forecasts and horoscopes.
The corpus is free for use for non-commercial purposes only.
Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the newspaper.
Every file begins with a header <teiheader> that contains information about file size, used tags etc.
The rest of the file is structured as follows:
<div0> is one issue of the newspaper, e.g. <div0 type='leht'><head> Lääne Elu 15.04.2008 </head> <div1> is a section e.g. <div1 type='rubriik'><head>Uudised</head> <div2> is an article, e.g. <div2 type='artikkel'><head> Tervishoid vajab rohkem riigi abi </head> The text has been annotated for paragraphs, sentences, headlines and authors.
The non-ASCII characters/symbols are presented using the following entities:
| Olem | Tähistatav |
|---|---|
| À | À |
| & | & |
| ą | ą |
| Å | Å |
| å | å |
| * | * |
| @ | @ |
| ã | ã |
| Ä | Ä |
| ä | ä |
| ć | ć |
| ° | ° |
| É | É |
| é | é |
| ë | ë |
| … | … |
| į | į |
| “ | “ |
| _ | _ |
| [ | [ |
| · | · |
| – | – |
| Ñ | Ñ |
| º | º |
| Ø | Ø |
| ø | ø |
| Õ | Õ |
| õ | õ |
| Ö | Ö |
| ö | ö |
| % | % |
| + | + |
| ± | ± |
| " | " |
| ” | ” |
| ] | ] |
| ś | ś |
| Š | Š |
| š | š |
| § | § |
| ² | ² |
| ³ | ³ |
| Ž | Ž |
| ž | ž |
| ˜ | ~ |
| × | × |
| Ų | Ų |
| Ü | Ü |
| ü | ü |
| | | | |