English-Estonian and Estonian-English parallel corpus

Content

This corpus contains:

Estonian laws and their translations into English, 392 files (packed in an archive et-en.zip)
EU legislation translated into Estonian, 2981 + 1093 files (packed in two archives en-et_t.zip and en-et_u.zip)

Their collecting and processing has been financed by a national program "Estonian language and national culture".

Sources

The texts originate from Estonian Legal Language Centre www.legaltext.ee on April 30, 2002. The aligned versions are based on the TEI P3 compatible versions of the same files www.cl.ut.ee in October 2004. The Estonian side of the corpus is a sub-part of the legislative texts' sub-corpus of the Reference corpus of Estonian. The file names reflect the source file names.

Tagging

The texts have been sentence-aligned. The items of lists are treated as equal to sentences. The Estonian and English sentences may be in 1-1, 1-2 or 2-1 alignments. There are no other alignments (like 1-0, 0-1, 2-2 etc) in this corpus. They were either not found or they were left aside as they would be hard to use in future work, the aim of which is to find parallel multi-word units.

The tags <eesti> and </eesti> delimit the Estonian part; <inglise> and </inglise> delimit the English part. The translation units are on a separate line each, intermittently, starting with the original (source) on the first line.

The subscripts and superscripts are tagged with <hi rend="sub"> and <hi rend="sup">. It often happens that the original or the translated unit contains one of them, but the corresponding parallel unit does not.

Size

Estonian-English parallel texts

153,500 parallel units (sentences or list items) in 392 files. 1.7 million tokens in Estonian, 2.9 million tokens in English.

English-Estonian parallel texts

English-Estonian parallel texts are divided into two groups, following the original division of the source texts in www.legaltext.ee:

224,323 + 57,836 parallel units (sentences or list items) in 2981 + 1093 files. 2.6 + 0.7 million tokens in Estonian, 3.9 + 1.0 million tokens in English.

Numbers and abbreviations have been counted as tokens.

Method

The aligning was done using the Vanilla aligner. It is a language independent aligner, based on the algorithm from: Gale, W. A. and Church, K. W. (1993) Program for aligning sentences in bilingual corpora. Computational Linguistics 19, 75-102.

The algorithm assumes that from the very beginning, the original (source) and its translation (target) consist of an equal number of smaller parallel units, delimited in some known way. All it has to do is to align smaller units inside these parallel units. For example, a book consists of chapters, the chapters consist of paragraphs, and the paragraphs consist of sentences. If we want to align paragraphs, we assume that no paragraph crosses the border of two chapters (the number of which is the same in both the source and its translation, and which are pair wise parallel already); if we want to align sentences, we assume that the number of paragraphs is the same in both texts and the paragraphs are pair wise parallel.

The algorithm also assumes that the order of sentences in the original text is the same as in the translation.

The algorithm assumes also that the length of the original and its translation are correlated. The translations of longer sentences are longer than translations of shorter sentences. When aligning the units, one should try to achieve that the length of the original is not too different from the length of the translation. Thus it is sometimes necessary to prefer 0-1, 1-2, 2-1 or other complicated alignments to 1-1 alignment.

Our aim was to find parallel sentences, so the paragraphs had to be aligned in the first place.

It appeared that often the original and translation contain a different number of paragraphs. There may be several reasons for that. For example, one text may contain an appendix (or several appendixes) that is missing from the electronic version of the parallel text, be it the original or the translation. The same may happen with tables and references to other documents. It may also happen that the lay-out of the texts is different, for example that in one text the newline-symbol denotes nothing more than the fact that the text continues on the following line, while in another this symbol denotes the end of a paragraph.

Thus the first task was to find some larger parallel units than a sentence or a list item, taking advantage of some anchor-points in the text. In case of legislative documents, we might consider the section, article and list item numbers as the anchor points. Using these, we could first align the paragraphs with the Vanilla aligner. In the next stage, the sentences were aligned.

If two parallel texts contained a different number of sections, articles or numbered list items, these texts were not included in the parallel corpus. We assumed that in such cases the formal structure of the texts was too different from each other and that the simple method used would not yield trustworthy results.

Webmaster Last modified: January 02 2019 15:10:01.