Eesti keeles

The Mixed Corpus: Chat rooms

Contents

This corpus contains 300 transcripts of internet chat rooms from 2003 and 2006. They make up 7 million words in 2.8 million chat lines.

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program ĢEstonian language and national cultureģ.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Sources and tagging

The texts originate from 22 different chat rooms.

The texts have been downloaded from the internet and converted to SGML (TEI). The conversion programs were written by Kaarel Veskis.

One file contains one uninterrupted transcript of a chat room.

The accented letters and some other symbols in the original sources have been converted to SGML-entities. The original form of the chat lines has been kept, including the usage of numbers instead of accented letters.

The basic idea behind tagging was that the transcript of a chat room is similar to a transcript of a play: the actors enter the stage, produce their lines, and leave the stage. The time of all the actions has been tagged as <time>, the speaker as <speaker>, and the actions between the chat lines as <stage>.

The mail and internet addresses have been changed in order to protect the privacy of the participants.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.

In the version of the chatroom corpus, that is accessible via our corpus interface, only the following mark-up occurs:

<speaker>Speaker</speaker><p>Text written by the speaker</p>
<stage>Speaker’s description of his/her activities, e.g. billy läheb ploomimahla tooma</stage>

The version of Chatroom corpus for downloading contains also the automatically generated comments, e.g. about the persons entering the chatroom, e.g.
<stage>* naga is now known as naga_eemal</stage>

These comments have been deleted in the version of the corpus accesible via corpus interface.

Size

7 million tokens in chat lines, plus 7 million tokens for describing the entering and leaving of the participants.

Symbols and entities

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain sgml entities.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: April 07 2011 18:46:05.