Eesti keeles

Reference Corpus of Estonian: (SL) Õhtuleht

Contents and size

This corpus contains the issues of the evening paper „Õhtuleht“ (also called „SL Õhtuleht“ during some periods) from 06. 03. 1997 until  31. 12. 2007, altogether 3344 issues; 45 572 699 words.

These texts form a part of the reference corpus of Estonian. Their collecting and processing has been financed by the National Program for Estonian Language Technology.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Contents

The texts originate from  http://www.ohtuleht.ee/arhiiv/

The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Krista Liin.

Non-textual material has been omitted. By non-textual material we mean pictures (e.g. photos, caricatures etc.)  The omitted material also includes TV programmes, hyperlinks, tabels (e.g. sports results, currency rates etc) and advertisments. The omitted material, except pictures, has been replaced by the tag  <gap desc=’description_of_the_omitted_material’>.

One file contains one newspaper issue. The subparts of the issue – sections, sub-sections and articles have been tagged, e.g.

    <div0  type='leht'> <head> SL &Ouml;htuleht 2004.10.19 </head>
<div1 type='rubriik'> <head> Uudised </head>
<div2 type='alamrubriik'> <head> L&uuml;hiuudised </head>
<div3 type='artikkel'> <head> Rahvakohtunik vahendas altk&auml;emaksu </head>

The division of the texts into paragraphs follows exactly the original HTML files, paragraphs have been tagged using the tag <p>. The sentenses have been tagged automatically using the tag <s>. The headings and authors have been tagged; not every article has a heading or an author. The author has been tagged using <bibl> <author> the text characterising the author (e.g. „Editor” is also eclosed inside these tags.

The mark-up follows the TEI guidelines.

The texts have not been corrected. No hyphenation.

The rendition information has been tagged, using the attribute ’rend’. If the rendition concerns a whole paragraph, then the attribute ’rend’ is used with the corresponding tag <p>. The possible tags and values for rendition are the following:  

    <hi  rend='rasvane'>
<hi rend='kaldkiri'>
<p rend='rasvane'>
<p rend='kaldkiri'>

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.

Entities

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain entities as documented in this table.

The entity &quest; is used for an unknown character.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: February 01 2010 01:08:19.