This subcorpus contains 87 000 postings of various newsgroups, 8 million words altogether. The texts date from the years 2000-2004.
The texts have been automatically saved from the Internet in 2004 and converted to SGML-format. The conversion programs were written by Kaarel Veskis.
One file contains postings from one newsgroup.
The corpus is free for use for non-commercial purposes only.
The basic idea was that the transcript of a newsgroup is similar to a transcript of a play: the actors enter the stage, produce their lines, and leave the stage. The time of posting the message has been tagged as <time>, the speaker/writer as <speaker>, and the theme of the message as <head>
Every file starts with a <teiheader> documenting the file contents, size, used tags etc.
The sentences in texts have been automatically annotated following the norms of the written language, e.g. if one posting contains more than one sentence, but the sentences do not begin with a capital letter, then they have not been annotated as a separate sentences.
Longer non-Estonian passages (if successfully automatically identified) have been removed and replaced by a tag <gap desc='võõrkeelne tekst'>. Pictures and emoticons have been replaced with a tag <gap desc='image'>, hyperlinks with a tag <gap desc='hüperlink'>.
If a previous posting has been cited, it has been tagged as <hi rend=’quote’> (in the downloadable TEI-version of the corpus). Not all citations have been recognized and tagged automatically, thus the corpus contains many repetitive sentences or passages.
In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.