This subcorpus contains the comments from the news portal Delfi from the period 26.01.2004 – 18.03.2004, altogether 2 161 098 words in 143 210 sentences.
The corpus is free for use for non-commercial purposes only.
Every file starts with a <teiheader> documenting the file contents, size, used tags etc.
Both the headline of the commented article and an author of every separate comment have been tagged as <head>, e.g.
<div0 type='kommentaarid'> <head> Delfi kommentaarid 2004 . Elukohatu saab karistada . </head> <div1 type='kommentaar'> <head> muki </head>
The time of writing is tagged as <time>
The sentences in texts have been automatically annotated following the norms of the written language, e.g. if one posting contains more than one sentence, but the sentences do not begin with a capital letter, then they have not been annotated as a separate sentences.
Longer non-Estonian passages (if successfully automatically identified) have been removed and replaced by a tag <gap desc='võõrkeelne tekst'>.
In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.