A discussion of how treebanks are published and used in Beyond Translation.
Beyond Translation is able to publish treebank data within the Perseus reading environment. (To see this Treebank in Perseus 6.0, see here).
Treebanks are texts that have rich linguistic annotation that captures the syntactic relations between words in a text. These relations, when visualized, look like inverted trees but the term more properly describes the mathematical structure of the underlying data. Each word can have one ancestor but many different descendants. Trees are a subset of the more general class of graphs and this facilitates certain kinds of analysis. We could (and at some point will) augment the base structure by adding secondary dependencies (like the dotted lines in an organizational chart) but for now the treebanks in Perseus permit one ancestor for each token.
The vast majority of treebanks are used to support data-driven linguistic analysis. Treebanks allow us to identify how often a given verb takes an object in the dative vs. accusative, what adjectives go with what nouns, and where higher level linguistic structures (e.g., genitive absolutes or future less vivid conditionals in Greek) appear.
Beyond Translation, however, offers treebank data to support readers who wish to understand how the words in a sentence relate to each other.
In the bottom half, the reader sees a traditional graphic visualization of the syntax, with the object (mênis, “godlike wrath”) linking upwards to the verb (aeide, “sing!”) on which it depends. The two words that depend upon mênis, (Achilêos, “of Achilles,” and oulomenên, “sociopathic”) are listed below it. The upper half of the figure shows how colors are used to represent ancestor a word (in blue) and its descendants (in green). This linear representation can particularly help when a sentence has a very large and unwieldy tree.
By default, the treebank viewer displays the translation for the given sentence where this is available. Adding sentence alignments can be a non-trivial task because translations will often break up longer sentences in the original or combine shorter sentences, but adding sentence alignments makes it much easier for users who do not know the language to search the original and then look for the search term in the original.
Initial treebank data in Beyond Translation follows the Perseus annotation scheme :
The syntactic relationship of a word in the sentence, as represented by a range of annotations (e.g., OBJ for object, ADV for adverbial, ATR for attribute).
The lemma (normalized dictionary form) of each inflected form (e.g., the imperative “sing!” in the first line of Iliad appears as aeide but its dictionary form is aeidô).
The morphological tag defines basic grammatical categories of a word. The tag for mênin is “n-s---fa-”: noun (n), singular (s), feminine (f) and accusative (a); the tag for aeide is “v2spma---“: verb (v), 2nd person (2), singular (s), present (p), imperative (m), active (a). Each position in this encoding defines the function and thus v2spmm would designate a middle imperative.
A serial number for each word in the sentence (which is not pictured in Figure 3 but is used to identify which words depend upon which).
We can view these features in Beyond Translation.
Once we have broken a text up into its constituent tokens, we can also add other classes of annotation.
It is not difficult to learn the Greek alphabet (one class can be enough for students to get started) but the ability to generate a transliteration for Greek is convenient and easy (for languages in writing systems such as Arabic script or Hebrew, the task is more challenging because short vowels are not normally included in a text). The glosses in this case are derived from a list developed by Helma Dik at Chicago (although we hope to replace them with more precise glosses from Cunliffe’s Homeric Lexicon). The grammatical tags point to a compact annotation grammar developed by Farnoosh Shamsian (for which see this separate section.
One crucial feature that we have begun to implement is to give credit for fine-grained contributions. We have an enormous amount of work before us as we build a new digital infrastructure for the study of the past. We need to exploit automated systems but we also have many opportunities accessible to students and citizen scientists. We need to recognize these contributions and not just refer, in general terms, to “student help.” We have a new, more decentralized, less hierarchical mode of intellectual production and proper credit is a key element to this.
Single, expert annotators have created many of the more recent treebanks, but we are able to manage credits for more complex production schemes. Two annotators analyzed each sentence in the Homer treebanks independently. An experienced annotator then examined where they differed and resolved those differences. We thus have three credits for each sentence. More than a decade has passed since the Homer Treebanks were completed and the credits have remained in the XML serialization of the treebank data on GitHub. Now, we can include not only the data but also the credits in the new version of Perseus that Beyond Translation is developing.
Support for treebanks in Perseus has been a long time coming — we first proposed developing a treebank for Greek in 2002. Work on a Latin treebank began after David Bamman received initial support from the National Science Foundation in 2007. Support from the Alpheios Project allowed us to begin work on a Greek treebank and by 2010 the first major component of that effort, a complete treebank for the Iliad and Odyssey, with more than 200,000 tokens, was complete.
In the decade that followed, the amount of manually treebanked Greek expanded to more than 1.4 million words, with major contributions not only from Francesco Mambrini and Giuseppe Celano working with Perseus at both Tufts and Leipzig, but also, using the Arethusa Annotation environment maintained by the Perseids Project, from Vanessa Gorman of the University of Nebraska, and Alek Keersmekers, Toon van Hal and their students at Leuven. Dag Haug and his colleagues at Oslo in the Proiel Project also contributed more than 200,000 words of treebanked Greek in a differing, but largely compatible, annotation scheme.
The Perseus treebanks have been used from very early stages of development to train models for automatic treebanking of larger corpora. The curated treebanks are, however, now sufficiently large that they can provide training data for fairly high performance automatic treebanking and we now have 10 million automatically treebanked words produced by the Leuven team that are available on Github.
The future for the Greek and Latin treebanks lies in the Universal Dependency Framework. When work began on the Perseus Treebanks, David Bamman and Marco Passarotti, developer of the Index Thomisticus Treebank, collaborated to develop compatible annotation schemes and they found the tagset of the Prague Dependency Grammar to be the most suitable for highly inflected languages such as Ancient Greek and Latin. In the meantime, the Universal Dependency (UD) Framework has emerged. Perseus and Proiel each converted materials from their Greek and Latin treebanks in 2014 to UD and these are now being used to generate models for parsing with two major Natural Language Processing pipelines, Stanza and Spacy.
A shift to UD is necessary for at least two reasons. First, where the effort of becoming familiar with the Perseus tagset prepares users to work with data from Perseus and the Prague Dependendency Treebank, an understanding of UD prepares users to work with treebanks in more than 100 languages. Learning Greek and Latin with treebanks becomes a gateway that opens up much broader linguistic exploration. Second, the UD community goes well beyond those annotating Greek and Latin. At present, we need to manage too much of our own linguistic infrastructure. By shifting to UD, we can begin to work with annotation schemes and search mechanisms that others have built. The shift would greatly enhance the appeal and sustainability of Greek and Latin corpus linguistics in general.
UD is not so very different from the Perseus format but conversion is not quite automatic. Francesco Mambrini has published his work so far on GitHub. This includes a UD version of the Perseus treebanks for the Iliad and Odyssey and we plan to add this to Beyond Translation as soon as possible.