Start People Corpora Resources

The Mixed Corpus: Arvutitehnika ja Andmetöötlus (Computer technics and data processing)

This corpus contains the texts from the internet archive of the journal „Arvutitehnika ja andmetöötlus“ („Computer technics and data processing“) http://deepthought.ttu.ee/aa/. The corpus contains the journal volumes from the years 1999 – 2005, approximately 625 000 words.

The collecting and annotating of these texts was supported by the national programme „The Language Technology Support for Estonian“.

The corpus is free for use for non-commercial purposes only.

Source and annotation

The texts have been semi-automatically downloaded from the internet and converted from PDF to SGML (TEI) format. The conversion programs were written and conversions made by Kaarel Veskis and Heiki-Jaan Kaalep.
One file contains one issue of the journal. The non-textual material (illustrations, figures) as well as tables and lists of references.

No spell-checking or error correction has been performed.

Annotation

Annotation follows the TEI guidelines. <div0> stands for one issue of the journal, <div1> stands for one article. The text has been divided into paragraphs following the original HTML markup, the sentences have been marked automatically (and hence the mark-up may contain some errors). The headings and authors have been annotated with <head> and <bibl><author> tags respectively.

Every file begins with a <teiHeader> that contains the information about the content and size of the file and lists the used tags.

The following entities have been used in this corpus:

Ä - Ä
É - É (latin big E with acute)
Õ - Õ
Ö - Ö
&Scaron; - Š
Ü - Ü
&Zcaron; - �
á - á (latin small a with acute)
à -à (small latin a with grave)
& - & (ampersand)
å - å (latin small a with ring)
ä - ä
&ccaron; - č (latin small c with caron)
é - é (latin small e with acute)
è - è (latin small e with grave)
ó - ó (latin small o with acute)
õ - õ
ö - ö
&scaron; - š
§ - § (section sign)
ü - ü

Webmaster Last modified: December 21 2018 19:19:59.