The corpus is free for use for non-commercial purposes only.
This corpus contains the issues of the weekly newspaper Maaleht from issue 20 2001 until issue 20 2004, the corpus contains approximately 4,3 million words. The distribution of words by files is in a table.
These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program Estonian language and national culture.
The texts originate from www.maaleht.ee.
The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written, conversions made and the frequencies calculated by Øivind Rangøy.
One file contains one issue. Non-textual parts (e.g. pictures, comic strips) have been omitted. TV programmes, all advertisments etc have also been omitted. Multiple occurrences of the same article have been deleted.
The opening quotation mark is the entity “. The closing quotation mark is the entity ”, single quote is '.
The rendition information has been tagged, using the attribute rend. The possible tags and values for rendition are the following: <hi rend='bold'>, <hi rend='italic'>, <hi rend='sup'>, <hi rend='underline'>, <hi>, <p rend='bold'>, <p rend='bold_italic'>, <p rend='bold_underline'>. <div0> stands for a whole issue, <div1> stands for a theme (e.g. "Uudised"), <div2> stands for an article.
The text has been divided into paragraphs according to the original HTML-file, the sentences have been tagged automatically. The titles and authors have been annotated using the tags <bibl><author><s>; an article can lack a title or an author.
Every file begins with a <teiHeader>, documenting the file contents, size, used tags etc.
The following entities have been used in this corpus:
| Entity | Sign | Explanation |
|---|---|---|
| Aacute | Á | capital A, acute accent |
| Aring | Å | capital A, ring |
| Auml | Ä | capital A, dieresis or umlaut mark |
| Ccaron | Č | capital C, caron |
| Ccirc | Ĉ | capital C, circumflex accent |
| Eacute | É | capital E, acute accent |
| Ncedil | Ņ | capital N, cedilla |
| Omacr | Ō | capital O, macron |
| Oslash | Ø | capital O, slash |
| Otilde | Õ | capital O, tilde |
| Ouml | Ö | capital O, dieresis or umlaut mark |
| Scaron | Š | capital S, caron |
| Umacr | Ū | capital U, macron |
| Uuml | Ü | capital U, dieresis or umlaut mark |
| Zcaron | Ž | capital Z, caron |
| aacute | á | small a, acute accent |
| acirc | â | small a, circumflex accent |
| aelig | æ | small ae diphthong (ligature) |
| agrave | à | small a, grave accent |
| amacr | ā | small a, macron |
| amp | & | ampersand |
| aring | å | small a, ring |
| atilde | ã | small a, tilde |
| auml | ä | small a, dieresis or umlaut mark |
| bull | • | bullet |
| cacute | ć | small c, acute accent |
| ccaron | č | small c, caron |
| ccedil | ç | small c, cedilla |
| curren | ¤ | general currency sign |
| dagger | † | dagger |
| deg | ° | degree sign |
| eacute | é | small e, acute accent |
| egrave | è | small e, grave accent |
| emacr | ē | small e, macron |
| eogon | ę | small e, ogonek |
| euml | ë | small e, dieresis or umlaut mark |
| euro | | euro sign |
| frac12 | ½ | fraction one-half |
| frac14 | ¼ | fraction one-quarter |
| frac34 | ¾ | fraction three-quarters |
| gt | > | greater-than sign R: |
| iacute | í | small i, acute accent |
| imacr | ī | small i, macron |
| kcedil | ķ | small k, cedilla |
| lcedil | ļ | small l, cedilla |
| ldquo | “ | left double quotation mark |
| lt | < | less-than sign R: |
| micro | µ | micro sign |
| middot | · | centerdot B: =middle dot |
| nacute | ń | small n, acute accent |
| ncaron | ň | small n, caron |
| ncedil | ņ | small n, cedilla |
| ntilde | ñ | small n, tilde |
| oacute | ó | small o, acute accent |
| ograve | ò | small o, grave accent |
| ohm | Ω | ohm sign |
| omacr | ō | small o, macron |
| oslash | ø | small o, slash |
| otilde | õ | small o, tilde |
| ouml | ö | small o, dieresis or umlaut mark |
| permil | ‰ | per mille sign |
| plusmn | ± | pm B: =plus-or-minus sign |
| pound | £ | pound sign |
| rarr | → | rightarrow /to A: =rightward arrow |
| rcaron | ř | small r, caron |
| rcedil | ŗ | small r, cedilla |
| rdquo | ” | right double quotation mark |
| reg | ® | circledR =registered sign |
| sacute | ś | small s, acute accent |
| scaron | š | small s, caron |
| sect | § | section sign |
| sup1 | ¹ | superscript one |
| sup2 | ² | superscript two |
| sup3 | ³ | superscript three |
| szlig | ß | small sharp s, German (sz ligature) |
| times | × | times B: =multiply sign |
| trade | ™ | trade mark sign |
| uacute | ú | small u, acute accent |
| ucirc | û | small u, circumflex accent |
| ugrave | ù | small u, grave accent |
| umacr | ū | small u, macron |
| uuml | ü | small u, dieresis or umlaut mark |
| yacute | ý | small y, acute accent |
| zcaron | ž | small z, caron |