This subcorpus contains the texts from Internet forums at http://forum.planet.ee/., altogether 5 million words in 165,000 postings (SGML) or 8 million words in 319,000 postings (XML TEI-P5). The texts originate from the years 2000 -i 2008.
One file contains discussions from one web directory (SGML version) or one forum (XML TEI-P5).
The corpus is free for use for non-commercial purposes only.
The basic idea was that the action in a forum can be described as a transcript of a play: the actors enter the stage, produce their lines, and leave the stage. The speaker/writer has been tagged as <speaker>, a text of one speaker as <sp> and the theme of the message as <head>
Every file starts with a <teiheader> documenting the file contents, size, used tags etc.
The sentences in texts have been automatically annotated
following the norms of the written language, e.g. if one posting
contains more than one sentence, but the sentences do not begin with a
capital letter, then they have not been annotated as a separate
sentences (SGML), or, alternatively, according to the forum conventions
(TEI-P5 version).
Longer non-Estonian passages (if successfully automatically identified) have been removed and replaced by a tag <gap desc='võõrkeelne tekst'>. Pictures and emoticons have been replaced with a tag <gap desc='image'>, hyperlinks with a tag <gap desc='hüperlink'>.
If a previous posting has been cited, it has been tagged as <hi rend=’quote’> (in the downloadable TEI-version of the corpus). Not all citations have been recognized and tagged automatically, thus the corpus contains many repetitive sentences or passages.
In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.