This corpus contains 300 transcripts of internet chat rooms from 2003 and 2006. They make up 7 million words in 2.8 million chat lines.
These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program ĢEstonian language and national cultureģ.
The corpus is free for use for non-commercial purposes only.
The texts originate from 22 different chat rooms.
The texts have been downloaded from the internet and converted to SGML (TEI). The conversion programs were written by Kaarel Veskis.
One file contains one uninterrupted transcript of a chat room.
The accented letters and some other symbols in the original sources have been converted to SGML-entities. The original form of the chat lines has been kept, including the usage of numbers instead of accented letters.
The basic idea behind tagging was that the transcript of a chat room is similar to a transcript of a play: the actors enter the stage, produce their lines, and leave the stage. The time of all the actions has been tagged as <time>, the speaker as <speaker>, and the actions between the chat lines as <stage>.
The mail and internet addresses have been changed in order to protect the privacy of the participants.
Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.
In the version of the chatroom corpus, that is accessible via our corpus interface, only the following mark-up occurs:
<speaker>Speaker</speaker><p>Text written by the speaker</p>
<stage>Speaker’s description of his/her activities, e.g. billy läheb ploomimahla tooma</stage>
The version of Chatroom corpus for downloading contains also the automatically generated comments, e.g. about the persons entering the chatroom, e.g.
<stage>* naga is now known as naga_eemal</stage>
7 million tokens in chat lines, plus 7 million tokens for describing the entering and leaving of the participants.
In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain sgml entities.