The PROIEL Treebank is a treebank of ancient Indo-European languages, including Latin and Ancient Greek. It uses a refined version of dependency grammar and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. On this site you will find our official, versioned releases of the treebank and pointers to further information.
The PROIEL Treebank is one of three treebanks that use the same annotation system, follow the same principles and available under the same license. The PROIEL Treebank covers Ancient Greek and Latin, as well as the translations of the New Testament into Gothic, Classical Armenian and Old Church Slavonic. The TOROT Treebank covers Old Church Slavonic, Old Russian and Middle Russian, while the ISWOC Treebank includes texts in Old English, Old French, Portuguese and Spanish. The complete collection currently has 928,185 tokens, all of which has been manually annotated with morphological and syntactic analyses. Parts of the treebank also have information-structure annotation and the New Testament texts include text alignment.
If you use the treebank, please cite as:
Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.
See also the following articles for further details:
Dag T. T. Haug, Marius L. Jøhndal, Hanne M. Eckhoff, Mari Johanne Hertzenberg and Angelika Müth. 2009. 'Computational and Linguistic Issues in Designing a Syntactically Annotated Parallel Corpus of Indo-European Languages'. Traitement automatique des langues 50 (2): 17-45.
Dag T. T. Haug, Hanne M. Eckhoff, Marek Majer and Eirik Welo. 2009. 'Breaking down and putting back together: analysis and synthesis of New Testament Greek'. Journal of Greek Linguistics 9 (1): 56-92.
Hanne Eckhoff, Kristin Bech, Gerlof Bouma, Kristine Eide, Dag Haug, Odd Einar Haugen and Marius Jøhndal. 2017. 'The PROIEL treebank family: a standard for early attestations of Indo-European languages'. Language Resources and Evaluation.
The following texts are currently included in the PROIEL Treebank:
|The Greek New Testament||Ancient Greek|
|Herodotus, Histories||Ancient Greek|
|Sphrantzes, Chronicles||Ancient Greek|
|Caesar, The Gallic War||Latin|
|Cicero, Letters to Atticus||Latin|
|The Armenian New Testament||Classical Armenian|
|The Gothic Bible||Gothic|
|Codex Marianus||Old Church Slavonic|
Please see the data files in the release distribution for complete contributor details and editorial notes.
The treebank was started as part of the research project Pragmatic Resources in Old Indo-European Languages, which was financed by the Norwegian Research Council. It originally comprised the New Testament in Ancient Greek and its translations into Latin, Old Church Slavonic, Gothic and Classical Armenian. The treebank has since been expanded to include ancient Indo-European texts in general and has spawned the TOROT and ISWOC treebanks, which are complementary to the PROIEL Treebank.
We are constantly expanding the treebank. The following texts are in the pipeline and will be included in an upcoming release:
|Cicero, De officiis||Latin|
|Palladius, Opus agriculturae||Latin|
|Plautus, opera omnia||Latin|
|Terence, opera omnia||Latin|
The morphosyntactic annotation scheme is described in the document PROIEL Guidelines for Annotation.
Our releases contain the treebank on our own PROIEL XML format. PROIEL XML is the authoritative format for PROIEL-style treebank and the only one that provides access to all the annotation we have, but for ease of use we also include the treebank as CoNLL-X and CoNLL-U files. We have a command-line utility that can be used to convert PROIEL XML to various other formats (see the documentation for examples), including formats used for training taggers. For more complex tasks we have a Ruby library, which is quite well documented.