Parallel corpus of Estonian and Swedish

Signe Cousins

University of Tartu

Department of Scandinavian Languages

signel@madli.ut.ee

1. Introduction

The preparatory work for the project of compiling an Estonian-Swedish parallel corpus of started in September 1994 and the actual work began in March 1995. It was planned as part of the Estonian-Swedish Lexical Database (ERLEKS) working within the framework of the Department of Scandinavian Languages under the supervision of Swedish guest professor Stig Örjan Ohlsson. I am working on the corpus for my MA degree.

I would like to express my gratitude to Magnus Bergvalls Stiftelse and Humanistisk-Samhällsvetenskapliga Forskningsrådet in Sweden for their financial support that has made this project possible.

2. Description of the corpus

The parallel corpus referred to here is a translation corpus, as opposed to a corpus of texts with similar structure and contents in two languages. It is intended that an approximately equal number of Estonian original texts with their Swedish translations and Swedish originals with their translations in Estonian will be included. The books currently in the corpus fall into two categories - fiction and popular science (history). The two popular science books (one of them a collection of articles by several authors) were found to be particularly useful as they were also published as parallel texts. It is, in fact, sometimes difficult to say which is the original and which is the translation in these books as several of the authors are exile Estonians in Sweden who write in both languages. These texts could be an interesting source of comparison between the use of language of these bilingual Estonians and that of the native speakers.

Some texts were received on disks in machine-readable form but some had to be scanned in. The machine-readable untagged versions of the two fiction books are courtesy of the corpus of Studia Comparativa Linguarum Orbis Maris Baltici in Turku, Finland. Plans for acquiring new texts in the near future include a periodical called "Ronor" published both in Swedish and Estonian in Noarootsi, an old Swedish area in north-western Estonia. It would bring some stylistic variation into the corpus and help to balance it. There are a couple of other popular science/history books available and permission will be sought from the publishing house for their use. It is hoped that their machine-readable versions can also be acquired from the same source.

3. Tagging and aligning

For a text to become usable in the corpus it first has to be normalised - texts have to be in ASCII format, the dashes and language-specific letters like å, õ, ä, ö and ü have to be replaced with special codes, e.g. ‘õ’ becomes õ. This allows the texts to be used in different applications. Printing and other mistakes have been corrected by inserting corresponding tags that include the right version.

Texts are tagged according to the Text Encoding Initiative (TEI) regulations with the help of the Standard Generalized Markup Language (SGML) (Sperberg-McQueen and Burnard 1994). The TEI was chosen as it seemed to be a flexible system with a wide range of tags and also because the Scandinavian corpus teams with whom we have contact are using this system, so making text exchange between different corpora possible. The texts have TEI-headers to store general information about the publication, the language and the text class as well as about the editing, tagging and aligning process the text concerned has undergone. Some tags are inserted automatically, e.g. <s>-tags denoting the borders of an S-unit (an orthographic sentence) are inserted with the help of a programme written by Knut Hofland of the Humanities Computing Centre in Bergen, Norway (Johansson, Ebeling and Hofland 1996). (This, as with several of the other programmes mentioned below, was originally designed for the English-Norwegian parallel corpus at the University of Oslo and then in some cases slightly modified for the Swedish-Estonian corpus.) Some tags are inserted semi-automatically and some still have to be put in manually.

The idea behind a parallel corpus is to align it so as to get the opportunity to compare the languages concerned. In this corpus alignment at sentence level has been used. Work is in progress to write a programme to find correspondences at word level. Firstly, an anchor list for the alignment programme had to be compiled. During this stage, as for some later work, I was assisted by Mari Aidla, BA in the Scandinavian Languages. A couple of examples from our anchor list with English translations added to them:

effekt* / efekt*, võimsus, mõjus*, tõhus*& 9;& 9;& 9;- effect, impact, result, power

efter / pärast, järel, piki, mööda, järgi, vastavalt, hiljem& 9;- after, behind, along, for,

& 9;according to, by, etc.

efternamn* perekonnanim*& 9;& 9;& 9;& 9;& 9;- surname

elak* / kuri, kurj*, paha*, halb*, halv*& 9;& 9;& 9;- bad, wicked, mean, evil

A slash is used to separate Swedish and Estonian forms when there is more than one in each language.

The asterisks show that the words have been truncated to include the different declined and conjugated forms of the word as well as the compounds beginning with that word and its derivatives. Often the truncation does not take into account morpheme borders and is solely based on practical reasons. As can be seen, due to the complicated declination and conjugation rules governing the Estonian language most of the Estonian equivalents had to be truncated and there often seem to be several Estonian equivalents to one Swedish word, meaning that the cases of one-to-one meaning equivalence are not as common between Swedish and Estonian as between Norwegian and English, for example. As a result the anchor list grew quite long which in its turn sometimes affects the functioning of the alignment programme.

The object is to give each S-unit a link to the corresponding S-unit in the parallel text. The programme moves a window consisting of 15 S-units through the text and checks a Swedish sentence against an Estonian one, looking for shared anchor words and the length of the sentence in characters. It calculates an anchor score and the number of shared anchor words and as a result finds the best match between the sentences. The programme also looks for question and exclamation marks, proper names and markup of highlighted text because the correspondence degree of such items is high. All this results in an alignment matrix of a text window and the aligned S-units. A path (diagonal) through the matrix maximises the sum of the common anchor list items.

An example of the alignment programme output - a matrix and aligned sentences:

115 100 40 80 88 147 31 55 97 68 57 106

1 2 3 4 5 6 7 8 9 10 11 12

------------------------------------------------

1 154 I 6 1 0 2 1 2 0 3 4 1 1 0

2 210 I 2 13 1 5 1 4 0 3 5 2 2 0

3 164 I 2 4 1 6 9 4 1 3 4 2 2 1

4 216 I 2 2 1 1 3 15 5 1 3 2 2 1

5 156 I 3 7 0 5 2 4 0 8 5 3 1 0

6 118 I 2 5 0 3 2 4 1 3 10 2 1 0

7 75 I 1 1 0 1 1 2 0 0 2 6 1 0

8 74 I 0 1 0 2 1 2 0 1 0 1 3 0

9 112 I 0 0 1 0 1 2 1 0 0 0 0 5

10 201 I 2 2 0 0 1 2 0 0 1 1 2 0

11 167 I 2 1 0 0 2 4 0 0 0 2 2 0

12 60 I 2 0 0 0 1 1 0 1 0 1 1 0

Sum=105/0.93: 1,1 2,2+3 3,4+5 4,6+7 5,8 6,9 7,10

------------------------------------------------------------------------------

2: <pb n=69> <omit desc=photo resp=tag> </div1> <div1 type=part id=HL1.20>

<head>Slottet i Haapsalu</head> <pb n=70> <p><s>Jacob De la Gardie var en av de
svenska adelsmän som verkligen fick möjlighet att ånjuta den kungliga generositeten
vad gällde förvärv och förläningar av gods i Östersjöprovinserna.</s> (HL1.20.1)

2: <pb n=69> <omit desc=photo resp=tag> </div1> <div1 type=part id=HL1T.20>

<head>Haapsalu loss</head> <pb n=70> <p><s>Jacob De la Gardie'l oli meeldiv
võimalus nautida kuninglikku suuremeelsust.</s> (HL1T.20.1)

3: <s>Ta omandas Läänemereprovintsides mõisaid.</s> (HL1T.20.2)

------------------------------------------------------------------------------

Swedish sentences are checked against the Swedish anchor word file and Estonian sentences against the Estonian anchor word file to find new items for the anchor list. In the case of very short S-units with a low number of words, problems may arise because the number of word correspondences remains low. In the case of free translation the number of word correspondences may be low and in the case of compound sentences the sentence length in characters does not match, sometimes resulting in alignment errors. The programme tests the length of the compound sentence versus the target sentence (it accepts a difference of up to 20%).

The programme outputs two aligned texts with S-units specified with attributes for ‘id’ (identifier) and ‘link’ (the identifier of the corresponding S-unit in the parallel text).

Alignment may highlight cases where a less direct translation has been chosen and, by so doing, draw our attention to differences between languages and illustrate ways in which translators think.

4. Other programmes used

are a simple concordancing programme and a statistics programme (giving number of occurrences of a word, number of sentences in which the word occurs, number of cases it occurs in 2-in-row sentences, number of cases it occurs in 3-in-row sentences), both written by Knut Hofland (Johansson and Ebeling 1994).

The use of the corpus is made easier with the help of a search engine working in Windows written by Jarle Ebeling, University of Oslo.

The morphological tagging is inserted into the texts with the help of ESTMORF, a morphology analyser for Estonian, programmed originally for the corpus of written Estonian by Heiki-Jaan Kaalep, University of Tartu (Kaalep 1996). A couple of examples of its output:

kui kui+0 //_D_ // kui+0 //_J_ // & 9;& 9;& 9;& 9;- if

Eesti Eesti+0 //_H_ sg g, sg n, //& 9;& 9;& 9;& 9;- Estonian

poolel, pool+l //_N_ sg ad, // pool+l //_S_ sg ad, //& 9;- on the side

The programme can find the stems of the lemmas or base forms, mark the borders of components in compounds and give different readings of a word, the latter subsequently making disambiguation necessary.

It is hoped that cooperation with a group of computational linguists in Helsinki will lead to the acquisition of a morphoanalyser of Swedish.

5. General statistics

Books	Para- graphs	S-units	S/P	Words	W/S	Ch/S	Ch/W
Eesti ja Rootsi (1993) (Estonia and Sweden)	433	2,005	4.63	29,652	14.79	115.26	7.79
Estland och Sverige	371	1,981	5.34	39,307	19.84	114.24	5.76
Rootsi mälestised Eestis (Lepp 1994) (Swedish relics in Estonia)	125	649	5.15	8,273	12.75	86.57	6.79
Kort vägledning till svenskminnen i Estland	142	640	4.48	11,660	18.22	99.39	5.46
Ajaloo ilu (Luik 1991) (The beauty of history)	528	2,580	4.89	35,277	13.67	-	-
Historiens förfärande skönhet (Luik 1993)	533	2,602	4.88	44,659	17.16	-	-
Hulkur Rasmus (Lindgren 1965) (Rasmus goes travelling)	1,626	8,161	5.02	38,607	4.73	25.68	5.43
Rasmus på luffen (Lindgren 1986)	1,610	7,254	4.51	45,293	6.24	27.84	4.46

193,424

Notes:

S/P - S-units per paragraph

W/S - words per S-unit

Ch/S - characters per sentence

Ch/W - characters per words

As can be seen from the table, both in the case of Swedish originals and translations into Swedish the number of words exceeds that of the Estonian equivalent. It is quite logical then that the average Swedish sentence contains more words than the Estonian one, but the percentage is surprisingly high - Swedish sentences seem to consist of 33.62% more words than Estonian ones, no matter whether originals or translations. The Estonian words appear to be longer than the Swedish ones by 27.11%. A comparison of the number of S-units shows the tendency of the Estonian texts to contain slightly more sentences. This fact leads us logically to the statement made in the next section.

6. Sentence division

The number of cases where one Swedish S-unit is equivalent to two Estonian ones is approximately 2.2 times greater than the other way round. There are also several cases of one Swedish S-unit being equivalent to three Estonian S-units, but not the other way round. Swedish seems to resort to a coordinate or subordinate clause much more often whereas Estonian generally begins a new sentence. Swedish also sometimes has a comma or a semi-colon where there is a full stop in the Estonian text. The most common ways of linking clauses in Swedish (English equivalents are given in brackets):

1 Swedish sentence - 2 Estonian sentences

‘och’ (and); ‘som’ (who/which/that); ‘men’ (but); ‘då’ (when/as/since); ‘bl.a’ (amongst others); ‘innan’ (before); ‘förmodligen’ (probably/presumably); a dash

7. Concordances

One of the first things even a limited corpus can be used for is making concordances. As the corpus project runs at the same time as the compilation of a major Swedish-Estonian dictionary, there is a welcome possibility to use the corpus as an aid for finding natural (not specially made up) examples and sufficient proof for giving an Estonian word or phrase as an equivalent of a Swedish one.

Up to now concordances have been made on:

* Swedish ‘ta’, ‘tar’, ‘tog’, ‘tagit’ and Estonian ‘võt*’ (to take)

* Swedish ‘fick’, past tense of ‘att få’ and Estonian ‘sai’, 3rd person singular, past tense of ‘saama’ (to get, to become, to receive)

* Swedish ‘var’, past tense of ‘att vara’ and Estonian ‘olid’, 3rd person plural, past tense of ‘olema’ (to be, to have)

* conjunctions - Estonian ‘ja’, ‘ning’, ‘kuid’, ‘ega’, ‘ehk’, ‘aga’ and Swedish ‘och’, ‘men’, ‘eller’ (and, as well as, but, or, nor, though)

As an example of the study of word correspondence I have looked at the meanings of Swedish ‘ta’ etc. and its Estonian equivalent ‘võt*’. It appears that in only 52.6% of cases, ‘ta’ or one of its forms was translated as ‘võt*’ and in only 61.2% ‘võt*’ was translated as a form of ‘ta’. This can depend on the tendency of ‘ta’ to occur in collocations and idioms which cannot be directly translated.

8. Applications

The corpus will progressively be used even more in the Swedish-Estonian dictionary project and probably in future lexicographic projects intended to be carried out at the Department of Scandinavian Languages. I have planned to use it in teaching the Swedish students lexicology and lexicography and it will certainly provide a good research basis for many linguistics students. Depending on which texts are going to be added to the corpus, different language variations can be studied.

References

Eesti ja Rootsi/ Estland och Sverige 1993. Ed. by Anne-Marie Dahlberg and Toomas & 9;Taimla. Tallinn: Huma

Johansson, Stig and Jarle Ebeling 1994. The English-Norwegian Parallel Corpus: & 9;Introduction and Applications. Paper submitted to The XXVIII International & 9;Conference on Cross-Language Studies and Contrtastive Linguistics. 15-17 & 9;December 1994, Rydzyna, Poland.

Johansson, Stig, Ebeling, Jarle and Knut Hofland 1996. Coding and aligning the English-& 9;Norwegian parallel corpus In Languages in Contrast. Papers from a Symposium & 9;on Text-based Cross-linguistic Studies, ed. by Karin Aijmer, Bengt Altenberg and & 9;Mats Johansson. In series Lund Studies in English 88 ed. by Sven Bäckman and Jan & 9;Svartvik. pp. 87-112. Lund University Press

Kaalep, Heiki-Jaan 1996. ESTMORF: A Morphological Analyser for Estonian. In Estonian & 9;in the Changing World. Ed. by Haldur Õim. Tartu

Lepp, Hans 1994. Rootsi mälestised Eestis. Lühike teejuht./ Kort vägledning till & 9;svenskminnen i Estland Tallinn: Huma

Lindgren, Astrid 1965. Hulkur Rasmus Translated by V. Beekman. Tallinn: Eesti Raamat

Lindgren, Astrid 1986. Rasmus på luffen Stockholm: Raben & Sjögren

Luik, Viivi 1991. Ajaloo ilu Tallinn: Eesti Raamat

Luik, Viivi 1993. Historiens förfärande skönhet Translated by I. Iliste, B. Göranson. & 9;Stockholm: Natur och Kultur

Sperberg-McQueen, Michael C.M., and Lou Burnard (eds) 1994. Guidelines for Electronic & 9;Text Encoding and Interchange. TEI P3. Chicago and Oxford: Association for & 9;Computers and the Humanities/Association for Computational Linguistics/ & 9;Association for Literary and Linguistic Computing, (electronic version)