Eesti keeles

The Mixed Corpus: Forums

Contents

This subcorpus contains the texts from Internet forums at http://forum.planet.ee/., altogether 5 million words in 165,000 postings (SGML) or 8 million words in 319,000 postings (XML TEI-P5). The texts originate from the years 2000 -i 2008.

One file contains discussions from one web directory (SGML version) or one forum (XML TEI-P5).

How can one use it?

The corpus is free for use for non-commercial purposes only.

Mark-up and annotation

The basic idea was that the action in a forum can be described as a transcript of a play: the actors enter the stage, produce their lines, and leave the stage. The speaker/writer has been tagged as <speaker>, a text of one speaker as <sp> and the theme of the message as <head>

Every file starts with a <teiheader> documenting the file contents, size, used tags etc.

The sentences in texts have been automatically annotated following the norms of the written language, e.g. if one posting contains more than one sentence, but the sentences do not begin with a capital letter, then they have not been annotated as a separate sentences (SGML), or, alternatively, according to the forum conventions (TEI-P5 version).
Longer non-Estonian passages (if successfully automatically identified) have been removed and replaced by a tag <gap desc='v&otilde;&otilde;rkeelne tekst'>. Pictures and emoticons have been replaced with a tag <gap desc='image'>, hyperlinks with a tag <gap desc='h&uuml;perlink'>.

If a previous posting has been cited, it has been tagged as <hi rend=’quote’> (in the downloadable TEI-version of the corpus). Not all citations have been recognized and tagged automatically, thus the corpus contains many repetitive sentences or passages.

In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: October 07 2011 20:05:54.