Syllabified corpora

Syllabified corpora in are:

Text class Number of words Origin
Fiction (Estonian authors) 104 000
Newspaper texts 111 000
Oral speech 100 000
Chatrooms 94 000
CHILDES caretaker language 400 000

Syllabification was a two-stage process:

  1. Mark word boundaries in compound words, using the morphological analyser by Filosoft (, using command line flag -a (meaning that the word is not lemmatised)
  2. Syllabify with hfst-xfst transducer silbita.xfscript

Code table is utf-8. Underscore "_" marks word bounderies in compound words; dot "." marks syllable boundaries.

CV structures found in the corpora are

