The PROIEL Treebank Framework

As of December 2016, this document is an evolving draft. All information here is correct and up to date but remains very incomplete.

The general format is proiel followed by a command, any options and one or more filenames:

proiel info -V caes-gal.xml cic-att.xml

Most commands also require sub-commands:

proiel convert conll -V caes-gal.xml

The filename arguments are the treebank files to process. All commands accept plain PROIEL XML files or gzipped PROIEL XML files:

proiel convert conll caes-gal.xml
proiel convert conll caes-gal.xml.gz

Validating

PROIEL XML can be validated using

proiel validate input.xml

This will peek at the file to determine the version of PROIEL XML to use, validate it using the appropriate XML schema and then run a number of integrity checks, which verify that cross-references between objects are valid and that the annotation is consistent with the annotation schema.

If you only want to validate the file using the XML schema, you can use a tool like xmllint

xmllint --nonet --noout --path path_to_schema_files --schema path_to_schema_files/proiel.xsd input.xml

Converting to CoNLL-X

Conversion to CoNLL-X is done using

proiel convert conll-x input.xml > output.conll

Official releases of the PROIEL treebank include the CoNLL-X format and can be downloaded from the PROIEL treebank.

Converting to CoNLL-U

Conversion to CoNLL-U can be done using

proiel convert conll-u input.xml > output.conllu

Note that the conversion is experimental and the output is likely to evolve as the Universal Dependencies project matures.

Curated versions of the PROIEL treebank on CoNLL-U format can be downloaded from the Universal Dependencies project.

Merging treebank files

Several treebank files can be merged into one treebank file by using proiel convert proielxml with multiple input files:

proiel convert proielxml caes-gal.xml cic-att.xml

The result will be a PROIEL XML file with multiple source elements:

<?xml version="1.0" encoding="UTF-8"?>
<proiel export-time="2014-12-19T12:44:28+01:00" schema-version="2.0">
  <annotation>
     ...
  </annotation>
  <source id="caes-gal" language="lat">
     ...
  </source>
  <source id="cic-att" language="lat">
     ...
  </source>
</proiel>

The treebanks to be merged must all use the same schema version and the same tagset.

Removing information from treebank files

Information can be removed from treebank files by using proiel convert proielxml with options like --remove-information-structure. Use proiel convert proielxml --help for a full list.

Searching for text

Simple tearches can be performed using proiel grep followed by a regular expression. This will serahc the text (which is the form attribute on tokens and any presentation_before and presentation_after attributes on tokens, sentences and divs) and return any text that matches the regular expression, as in this example:

$ proiel grep 'pel' caes-gal.xml
Caes. Gal. 1.1.1 (ID = 52548) Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
Caes. Gal. 1.3.3 (ID = 52570) In eo itinere persuadet Castico, Catamantaloedis filio, Sequano, cuius pater regnum in Sequanis multos annos obtinuerat et a senatu populi Romani amicus appellatus erat, ut regnum in civitate sua occuparet, quod pater ante habuerit;
...
$ proiel grep '^pel' caes-gal.xml
Caes. Gal. 3.13.6 (ID = 53210) pelles pro velis alutaeque tenuiter confectae, sive propter inopiam lini atque eius usus inscientiam, sive eo, quod est magis veri simile, quod tantas tempestates Oceani tantosque impetus ventorum sustineri ac tanta onera navium regi velis non satis commode posse arbitrabantur.

The regular expression is applied to one sentence at a time so the anchors ^ and $ refer to the beginning and end of the sentence.

To apply a regular expression to each individual token instead, use the --level token option:

$ proiel grep 'pel' --level token caes-gal.xml
Caes. Gal. 1.1.1 (ID = 680740) appellantur.
Caes. Gal. 1.3.3 (ID = 681128) appellatus
Caes. Gal. 1.12.4 (ID = 682300) appellabatur
...
$ proiel grep '^pel' --level token caes-gal.xml
Caes. Gal. 1.31.11 (ID = 685232) pellerentur
Caes. Gal. 2.33.2 (ID = 693103) pellibus
Caes. Gal. 3.13.6 (ID = 852327) pelles
...

Matching is by default case sensitive. Use the -i option for case-insensitive matching:

$ proiel grep 'Gal' --level token caes-gal.xml
Caes. Gal. 1.1.1 (ID = 680720) Gallia
Caes. Gal. 1.1.1 (ID = 680739) Galli
Caes. Gal. 1.1.2 (ID = 680749) Gallos
...
$ proiel grep 'Gal' --level token -i caes-gal.xml
...
Caes. Gal. 1.17.4 (ID = 761727) Gallia
Caes. Gal. 1.18.3 (ID = 683173) vectigalia
Caes. Gal. 1.19.3 (ID = 756644) Galliae
...