This corpus contains the issues of the evening paper „Õhtuleht“ (also called „SL Õhtuleht“ during some periods) from 06. 03. 1997 until 31. 12. 2007, altogether 3344 issues; 45 572 699 words.
These texts form a part of the reference corpus of Estonian. Their collecting and processing has been financed by the National Program for Estonian Language Technology.
The corpus is free for use for non-commercial purposes only.
The texts originate from http://www.ohtuleht.ee/arhiiv/
The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Krista Liin.
Non-textual material has been omitted. By non-textual material we mean pictures (e.g. photos, caricatures etc.) The omitted material also includes TV programmes, hyperlinks, tabels (e.g. sports results, currency rates etc) and advertisments. The omitted material, except pictures, has been replaced by the tag <gap desc=’description_of_the_omitted_material’>.
One file contains one newspaper issue. The subparts of the issue – sections, sub-sections and articles have been tagged, e.g.
<div0 type='leht'> <head> SL Öhtuleht 2004.10.19 </head>
<div1 type='rubriik'> <head> Uudised </head>
<div2 type='alamrubriik'> <head> Lühiuudised </head>
<div3 type='artikkel'> <head> Rahvakohtunik vahendas altkäemaksu </head>
The division of the texts into paragraphs follows exactly the original HTML files, paragraphs have been tagged using the tag <p>. The sentenses have been tagged automatically using the tag <s>. The headings and authors have been tagged; not every article has a heading or an author. The author has been tagged using <bibl> <author> the text characterising the author (e.g. „Editor” is also eclosed inside these tags.
The mark-up follows the TEI guidelines.
The texts have not been corrected. No hyphenation.
The rendition information has been tagged, using the attribute ’rend’. If the rendition concerns a whole paragraph, then the attribute ’rend’ is used with the corresponding tag <p>. The possible tags and values for rendition are the following:
<hi rend='rasvane'>
<hi rend='kaldkiri'>
<p rend='rasvane'>
<p rend='kaldkiri'>
Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.
In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain entities as documented in this table.
The entity ? is used for an unknown character.