Reference corpus: weekly Maaleht

Content

This corpus contains the issues of the weekly newspaper Maaleht from issue 20 2001 until issue 20 2004, the corpus contains approximately 4,3 million words. The distribution of tokens by years is:

Year	Tokens
2001	850,176
2002	1,369,809
2003	1,477,490
2004	577,756

The corpus is free for use for non-commercial purposes only.

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program Estonian language and national culture.

Sources

The texts originate from www.maaleht.ee.

The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written, conversions made and the frequencies calculated by Øivind Rangøy.

One file contains one issue. Non-textual parts (e.g. pictures, comic strips) have been omitted. TV programmes, all advertisments etc have also been omitted. Multiple occurrences of the same article have been deleted.

Markup

The opening quotation mark is the entity “. The closing quotation mark is the entity ”, single quote is '.

The rendition information has been tagged, using the attribute rend. The possible tags and values for rendition are the following: <hi rend='bold'>, <hi rend='italic'>, <hi rend='sup'>, <hi rend='underline'>, <hi>, <p rend='bold'>, <p rend='bold_italic'>, <p rend='bold_underline'>. <div0> stands for a whole issue, <div1> stands for a theme (e.g. "Uudised"), <div2> stands for an article.

The text has been divided into paragraphs according to the original HTML-file, the sentences have been tagged automatically. The titles and authors have been annotated using the tags <bibl><author><s>; an article can lack a title or an author.

Every file begins with a <teiHeader>, documenting the file contents, size, used tags etc.

SGML-entities

SGML-files contain entities listed in this table

Webmaster Last modified: December 21 2018 17:20:19.