This subcorpus contains texts from the newsmagazine Luup, ca 1,9 million words altogether, 130 issues, 2298 articles. The corpus contains issues from the years 1996-2002, namely
The texts originate from the webpage http://luup.postimees.ee/
Texts have been semi-automatically downloaded from the web and converted from HTML-format to TEI-format.
The corpus is free for use for non-commercial purposes only.
Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the journal.
Every file begins with a header <teiheader> that contains information about the file size, used tags etc.
The rest of the file is structured as follows:
<div0> is one issue of journal, e.g. <div0 type='ajakirjanumber'><head>Luup Nr. 13 (122), 22. juuli 2000</head> <div1> is a section e.g. <div1 type='rubriik'><head>JUHTKIRI</head><div2> is an article, e.g. <div2 type='artikkel'><head>Kõige tähtsam raamat</head> <div3> is a subpart of an article, e.g. <div3 type='alaosa'><head>Kui jäämäed hakkavad sulama</head> (in the issues 1998 Nr. 14 – 2002 Nr. 04 the mark-up of <div3> may be erroneous. The text has been annotated for paragraphs <p>, sentences <s>, headlines <head> and authors <bibl><author>.
The non-textual material has been omitted from the text and replaced by a tag <gap desc=’description_of_the_omitted_material’>. By non-textual material we mean pictures (photos, drawings, diagrams etc), tables etc.
In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.
The non-ASCII characters/symbols are presented using the following entities:
| Entity | Symbol |
|---|---|
| â | acirc |
| à | agrave |
| À | Agrave |
| & | amp |
| Å | Aring |
| å | aring |
| ä | auml |
| Ä | Auml |
| • | bull |
| ć | cacute |
| Ć | Cacute |
| ° | deg |
| é | eacute |
| É | Eacute |
| è | egrave |
| ë | euml |
| ¼ | frac14 |
| > | gt |
| í | iacute |
| « | laquo |
| “ | ldquo |
| < | lt |
| µ | micro |
| · | middot |
| ń | nacute |
| Ó | Oacute |
| ó | oacute |
| ô | ocirc |
| Ø | Oslash |
| ø | oslash |
| õ | otilde |
| Õ | Otilde |
| ö | ouml |
| Ö | Ouml |
| ‰ | permil |
| ± | plusmn |
| » | raquo |
| ” | rdquo |
| š | scaron |
| § | sect |
| ¹ | sup1 |
| ² | sup2 |
| ³ | sup3 |
| ß | szlig |
| ú | uacute |
| Ü | Uuml |
| ü | uuml |
| ž | zcaron |
| Ž | Zcaron |