Skip to main content
SearchLoginLogin or Signup

Building Perseus 6: a new production version of the Perseus Digital Library

Published onJan 19, 2024
Building Perseus 6: a new production version of the Perseus Digital Library
·

Perseus has finalized plans for Perseus 6, a new public facing version of its digital library. Support from the Humanities Collections and Reference Resources program at the National Endowment for the Humanities, from Tufts Data Intensive Studies Center (Disc), from Tufts Technology Services, from the School of Arts and Sciences, and from Google are funding this transition. Our goal is to complete this transition by September 1, 2024. James Tauber of Signum University (formerly of Eldarion.com) is our lead partner in this work. We are also working with a partner to update the Perseus home page (whom we will name when Tufts finalizes the contract that we have approved).

The most important loose ends (as of January 2024) are the Perseus Art and Archaeology collections. These consist of CIDOC-CRM metadata (available for download on the home page for these collections), essays and descriptions, and a corpus with thousands of images, many of them taken by Maria Daniels in her years as Perseus Photographer. These collections have a separate (and much simpler) infrastructure than that offered by the main Perseus digital library system (“the Hopper”). The biggest challenge for these collections is not technology but licensing. Maria Daniels took thousands of pictures of art objects in dozens of museums in the US and Europe, but she did so before the rise of the internet and our agreements were based on earlier technology (Videodisc and CD ROM) and we were, in any event, conservative in what we requested. Planning for the future collections is ongoing.

Background: Perseus 4 to Perseus 6

In numbering versions of Perseus, readers will notice that we are jumping from Perseus 4 directly to Perseus 6. We did not formally designate a Perseus 5 because we have not yet created a system that would replace Perseus 4 but we informally view Perseus 5 as the combined capability of two earlier projects, the Scaife Viewer and the Beyond Translation Reading environment. Together, these two projects can replace Perseus 4 and we thus view them, collectively, as Perseus 5. We use Perseus 6, however, to designate the system that we are now building, which combines the features of the Scaife Viewer and Beyond Translation, allowing us not only to replace Perseus 4 but also to include new services and new categories of data that have become available over the subsequent 20 years.

As I write in January 2024, the current public facing version of the Perseus Digital Library, Perseus 4 (“the Hopper”), is older than many undergraduates. David Mimno, now an Associate Professor in the department of Information Science at Cornell, built the first version of Perseus 4 in 2023 while working at head programmer at Perseus, before he had even begun graduate school. Others (such as Adrian Packel, Gabriel Weaver, Rashmi Singhal, and Bridget Almas) would further develop Perseus 4 over the next ten years. In 2013, we shifted to maintaining Perseus 4 in its then current state. The fact that we have been able to spend a decade moving towards replacing Perseus 4 reflects the quality of design and subsequent development that went into this system more than twenty years ago.

One major decision that we made in 2013 required a substantial investment in reorganizing our collections. We adopted the Canonical Text Services (CTS) data model (Konieczny 2021, Tiepmar 2019). CTS provides URNs (Uniform Resource Names):

  • Unique identifiers that pinpoint specific text passages.

  • Structured format: urn:cts:<namespace>:<work>:<passage>

  • Example: urn:cts:greekLit:tlg0012.tlg001.msA:1.1-1.5 (refers to lines 1-5 of Book 1 of the Iliad as they appear in a specific manuscript)

The adoption of CTS led to a revision of how we organized our TEI XML files. We had had great flexibility using TEI milestones and containers. We decided to regularize this work, using containers to represent the primary citation scheme for any given work. This conversion can be partly, but not completely, automated and there are many edge cases that require manual intervention.

In 2018, we produced the Scaife Viewer, a CTS-based browsing and searching environment with a newer code base and designed for scalability. The Scaife Viewer not only allowed Perseus to publish materials not only in Greek and Latin but also in languages such as Hebrew, Classical Persian and K’iche’ Maya. The Scaife Viewer enjoyed an unexpected success when Brill adopted it as the platform for all of its scholarly editions. Brill uses, and contributes to, the open source Scaife code base to publish its own largely proprietary content. The one open source collection currently available from Brill is the Literary History of Medicine. Funded by the Wellcome Trust, the LHM is available under a CC-BY-NC 4.0 License and (hopefully) demonstrates a different business model based on open data. Readers can see, among things, facing Arabic text and English translation in the LHM (e.g., Ibn Abī Uṣaybiʿah, The Best Accounts of the Classes of Physicians.

If our only goal had been to replace Perseus 4, we could have continued extending the Scaife platform. We chose instead to complement Scaife with the Beyond Translation Project. Beyond Translation developed a more flexible backend (ATLAS) to address two challenges.

First, the Scaife Viewer is based upon CTS-compliant TEI XML as defined by the CapiTainS Guidelines (Clérice 2017). While we continue to transform data into this format the wanted to able to import textual data that was available in a much simpler format.

Figure 1: the opening of the Iliad with citation (in this case, book + ‘.’ + line) followed by plain text.

The example above contains an identifier followed by plain text. The format above does not support inline formatting (although that can be added as a layer of standoff markup). We do, of course, need to store CTS metadata about the text group (in this case, Homer: tlg0012), the work (the Iliad: tlg001) and the edition (the Monro-Allen Oxford Classical Text: perseus-grc2) but, once we have that basic data, we can import the text.

Second, we needed to support a wider range of annnotations and services that were available when Perseus 4 was developed in 2004. The plain text format above allowed us not only also allowed us to manage a growing number of annotations and annotation classes. We needed to support classes of annotation (such as treebanks and word-level translation alignments) that we not available in 2003 when work on Perseus 4 began.

With primary support from the National Endowment for the Humanities Office of Digital Humanities, we were able to build the Beyond Translation reading environment (2019-2023). Beyond Translation allowed us to integrate an open ended set of annotation classes, with a particular focus on treebanks and word-level translation alignments. Planning for Greek and Latin Treebanks actually began in 2002 (Crane 2002) before Perseus 4 but development did not begin until 2007. David Bamman, then head programmer at Perseus, and Harry Diakoff of the Alpheios Project introduced a second crucial annotation class: links between words and phrases in translation to the corresponding words and phrases in a source text. Beyond Translation supports not only treebanks and translation alignments but metrical analyses, automatically generated maps, links between CTS text and details of page images (via the International Image Interoperability Framework: IIIF) and more general annotations to words and phrases in a text.

Road Map for Perseus 6

Creating Perseus 6 involves the following tasks, each of which has now been funded.

Replacing the Perseus home page (and associated subpages) and the Perseus Blog: We have begun to use Knowledge FuturesPubPub publishing platform to publish information about Perseus (such as this document). This will moving older publications from Google Docs and the Tufts Wordpress site. We are also going to add a Perseus Data Journal in which we can provide more detailed information about the documents and services that we can provide.

Integration of the Beyond Translation features into the Scaife Viewer: Users will see that the front-end layout of the Scaife Viewer and the Beyond Translation Reading Environment are very similar. The backends were designed to be sufficiently modular so that we can merge their functionality. In theory, the micro-services architecture makes this a relatively straightforward task but almost nothing is ever as simple as planned.

Lemmatization and Part of Speech Pipeline: (1) We want to be able to search for particular dictionary words (e.g., retrieve all instances of Latin facio, “to make”), particular parts of speech (e.g., retrieve all subjunctives a given section) and both at once (e.g., retrieve subjunctive forms of facio). (2) Readers looking at a particular word in a particular text should be able to be able to view its dictionary form and part of speech.

There are now a number of different repositories that have published lemmatization and part of speech tagging for Greek and Latin corpora. The first published treebanks were manually produced (e.g., the Perseus Treebanks, Gorman Trees, Pedalion, Proiel). Researchers immediately began using the first available treebank data as training data for machine learning (Bamman 2011; Majidi 2013). The error rates were high but automatically generated (Celano 2017, Vatri and McGillivray 2018, Keersmaekers 2022, Burns 2023).

Our goals are:

  • To provide rational access to multiple different datasets that can be keyed to particular words in particular passages (e.g., Latin insula, “island,” as nominative in one passage and ablative in another). This includes being able to prioritize curated before automatic annotations.

  • To support a sustainable default annotation system that can provide lemmatization and part of speech data so that we always have at least one result. A major here is to be able to upgrade this default system over time as better models or systems emerge.

Additional Treebank Integration: Beyond Translation supports multiple treebanks for a given text (e.g., the Homer Treebank in the Perseus annotation scheme and the automatically generated version from Glaux Trees). Even when they use the same annotation schemes, Greek and Latin Treebanks tend to differ slightly in how they publish their data. The main work here is to wrangle data from different sources into a compatible version that we can then process.

Automated content ingestion: Perseus 6 should automatically update itself whenever one of its trusted information sources (typically a GitHub repository) is updated. As we include content that has not passed through the strict CapiTainS validation procedures (such as treebanks) the situation becomes more complicated but (for the moment) tractable, so long as third parties keep their publication formats reasonably stable.

Integration of Perseus Catalog Data: First published in 2013, the Perseus Catalog (Mimno 2015; Babeu 2022) makes available metadata that uses the Functional Requirements for Bibliographic Records (FRBR) to represent text groups, works and editions of Greek and Latin sources. First, like other catalogs of Greek and Latin sources, the Perseus Catalog covers works that are lost except insofar as surviving sources mention, paraphrase or quote them (which philologists describe as fragmentary authors and work). Second, unlike many catalogs, however, the Perseus Catalog was designed not just to provide a checklist of edition chosen by a particular reference work but to be able to track multiple editions of a given work. Ultimately, we would be able to manage comprehensive catalogs for every witness for any work: not only critical editions but papyri, manuscripts and other textual sources. Third, the Perseus Catalog can point to scanned versions of sources and thus allow readers to consult page images where the text has not yet been transcribed or where readers wish to compare transcriptions against images of the source.

Generalized Table of Contents Functionality: While we can arbitrarily serve

chunks of a source text (e.g., one book, chapter or section at a time in a work split into books, chapters and section or clumps of 10, 20 or 30 lines for poetry), editors have chosen their own ways to divide texts into chunks.

Figure 2: Chunking for the opening of the Odyssey in Perseus 4.

Figure 3: initial table of contents widget for Perseus 6, mapping book+line numbers to folio pages in the Venetus A manuscript of the Iliad.

Many of the poetic texts in Perseus have breaks encoded in their XML. These are called “cards” and reflect the fact that they were designed to break text into separate contains, called cards. The name is based on HyperCard, a pioneering, but now

defunct, system that Apple published in 1987 and that provided the publication medium for both Perseus 1 and 2. We want, ultimately, to be able to use multiple schemes to chunk a single text. For we want to be able to add at least one curated chunking scheme for any text.

Ingestion and display of reference works: Perseus has offered integrated commentaries since Perseus 3.

Figure 4: Marchant’s commentary on the opening of Thucydides’ History of the. Peloponnesian War as it appears in Perseus 4. The commentary is aligned to book 1, chapter 1, section 1, but there are no links into the Greek text.

Charles Pletcher has largely completed work updating the XML of legacy Perseus commentaries so that they can be ingested into Perseus 6. The markup revision involves not only a change in format but also the beginning of a longer term effort to improve the functionality.

Figure 5: Marchant on the opening of Thucydides in Beyond Translation. Now we can link the commented text in annotations to the relevant spans in the source text but the links in the commentary need to match and perfecting these links will take another layer of review. For now, though, our goal is to link what we can with the data as it is and refine the links (in as automated a fashion as possible) at a later time.

Word Study Tool:

Figure 6: Perseus Word Study tool with basic morphological information, short definition, links to dictionary entries and statistics for the current work.

Figure ZZZ: Fuller statistics about the word frequency throughout the Perseus corpus.

The venerable Perseus Word Study tool is designed to provide basic information about any word in the corpus. The tool in Perseus 4 was based on the Morpheus morphological analyzer which simply enumerates each possible morphological analysis and dictionary form for any given word, regardless of its immediate context. We did implement a voting system that collected data but the system became impractical when Perseus moved from a single server and began to run on multiple virtual machines. We did not have the resources to aggregate votes from different machines and but we still show the legacy data.

We now have much different — and better — data, with much less ambiguity about morphological analysis and dictionary form. We also have access to syntactic data (e.g., how often a verb takes a dative object, what adjectives go with what nouns etc.). The technical work to display a new Word Study tool does not need to be elaborate. The first challenge is to decide what data to show and how to show it.

Accessibility: We need to update the new system with basic levels of standard accessibility. The Perseus UI, while intuitive, is complex. Work to ensure compliance with accessibility standards is critical. Such updates do not by themselves constitute research but they provide a starting point for new research questions. Our core research focuses on making complex sources intellectually more accessible and developing a wide range of services to augment intelligence. 

URL redirection: We need to redirect existing URLs, updating them to align with new, better structured data models upon which we are now building. This is a complex process because, even as we move on from Perseus 4, we still have some links that reflect Perseus 3 (which was retired in 2005). However, developing an ability to redirect these millions of links also positions us to analyze patterns of usage.

Analytics: Once we can see what data and services our users consult and in what sequences, we have an opportunity to do fundamental work in reception studies (how the world views literature) and in new forms of digitally mediated reading. We can begin with some simple features such as most commonly viewed authors/works/passages and on patterns of use (e.g., the extent to which those focusing on English translation read different sources than those focusing on the Greek and Latin originals).

Multilingual content migration: The current Perseus does have some support for Arabic as well as Greek, Latin and modern European languages, but we need to broaden our linguistic coverage and enhance our ability to integrate Arabic and Persian – we have collaborators in the US and beyond who want to use our reading environment for these languages.

Software review and bug fixes: We have more general software work that will strengthen our services and enhance our ability to do research and attract funding.

The new version of Perseus is already largely container based, but some work is required to conform all component configuration to a container orchestration solution (Kubernetes) supported by Tufts Technology Services (TTS).

References

Babeu, Alison. “The Perseus Catalog: Of FRBR, Finding Aids, Linked Data, and Open Greek and Latin.” The Perseus Catalog: Of FRBR, Finding Aids, Linked Data, and Open Greek and Latin, De Gruyter Saur, 2019, pp. 53–72, https://www.degruyter.com/document/doi/10.1515/9783110599572-005/html.

Bamman, David, and Gregory Crane. “Measuring Historical Word Sense Variation.” Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries, 2011, pp. 1–10, https://people.ischool.berkeley.edu/~dbamman/pubs/pdf/jcdl2011.pdf.

Burns, Patrick J. LatinCy: Synthetic Trained Pipelines for Latin NLP. arXiv:2305.04365, arXiv, 7 May 2023, https://doi.org/10.48550/arXiv.2305.04365.

Celano, Giuseppe G. A. POS-Tagged Ancient Greek Texts (v1.0.0). v1.0.0, Zenodo, 21 Mar. 2017, https://doi.org/10.5281/zenodo.437103.

Clérice, Thibault. “Les outils CapiTainS, l’édition numérique et l’exploitation des textes.” Médiévales. Langues, Textes, Histoire, vol. 73, no. 73, Dec. 2017, pp. 115–31, https://doi.org/10.4000/medievales.8211.

Crane, Gregory. “Don’t Miss the Lexicographers for the Treebanks Philology in an Electronic Age.” Conference on the Cambridge Greek Lexicon, July 2002, https://www.academia.edu/82054421/Dont_miss_the_lexicographers_for_the_treebanks_Philology_in_an_electronic_age.

Gorman, Vanessa B. “Dependency Treebanks of Ancient Greek Prose.” Journal of Open Humanities Data, vol. 6, no. 1, Mar. 2020, p. 1, https://doi.org/10.5334/johd.13.

Keersmaekers, Alek, and Toon Van Hal. “Creating a Large-Scale Diachronic Corpus Resource: Automated Parsing in the Greek Papyri (and beyond) - KU Leuven.” Natural Language Engineering, 2022, https://doi.org/10.1017/S1351324923000384.

Konieczny, Michael. “What Is a CTS URN?” Perseus Digital Library Updates, 2021, https://sites.tufts.edu/perseusupdates/2021/01/05/what-is-a-cts-urn/. Accessed 14 Jan. 2024.

Majidi, Saeed, and Gregory Crane. “Active Learning for Dependency Parsing by A Committee of Parsers.” Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013), Assocation for Computational Linguistics, 2013, pp. 98–105, https://aclanthology.org/W13-5711.

Mimno, David, et al. “Hierarchical Catalog Records: Implementing a FRBR Catalog.” D-Lib Magazine, vol. 11, no. 10, 2005, https://doi.org/10.1045/october2005-crane.

Tiepmar, Jochen, and Gerhard Heyer. “The Canonical Text Services in Classics and Beyond.” The Canonical Text Services in Classics and Beyond, De Gruyter Saur, 2019, pp. 95–114, https://www.degruyter.com/document/doi/10.1515/9783110599572-007/html?lang=en.

Vatri, Alessandro, and Barbara McGillivray. “The Diorisis Ancient Greek Corpus | The Alan Turing Institute.” Research Data Journal for the Humanities and Social Sciences, vol. 3, no. 1, 2018, pp. 55–65, https://www.turing.ac.uk/news/publications/diorisis-ancient-greek-corpus.

Comments
0
comment
No comments here
Why not start the discussion?