Backoff Lemmatization for Ancient Greek with the Classical Language Toolkit
Patrick J. Burns (NYU)
Digital Classicist London seminar 2018
Friday July 27th at 16:30, in room 234, Senate House, Malet Street, London WC1E 7HU
Livecast at Digital Classicist London YouTube channel.
Automated lemmatization, that is the retrieval of dictionary headwords, is an active area of research in historical-language text analysis. In this paper, I describe the development of the Backoff Lemmatizer for Ancient Greek with the Classical Language Toolkit (CLTK), an open-source Python platform dedicated to developing natural language processing tools for historical languages (Johnson, 2017). Hellenists have available web-based applications such as Eulexis (Verkerk et al. 2017) and web services such Morpheus (Almas, 2015), and recent work has seen part-of-speech-assisted tagging (Bary, Berck, and Hendrickx 2017) used for improving Ancient Greek lemmatization. The Backoff Lemmatizer seeks to improve on existing tools by combining training-data-based and rules-based tagging as a lemmatization strategy.
The Backoff Lemmatizer is in fact not a single lemmatizer but rather a customizable suite of sublemmatizers, based on the Natural Language Toolkit’s SequentialBackoffTagger. This multi-pass tagger allows the user to “chain taggers together so that if one tagger doesn’t know how to tag a word, it can pass the word on to the next” (Perkins, 2014, 92). While the backoff process was originally designed for partof-speech tagging, it is also proven an effective strategy for lemmatization (~90.34% accuracy as tested on Latin as compared to the 93.49% to 95.30% range reported in Eger et al., 2015). This paper focuses on the development of the Backoff Lemmatizer for Ancient Greek, but in the interest of addressing “standardization and customization in digital and collaborative classics research,” this paper will by way of conclusion discuss a current development strategy at the CLTK, namely the use of objectoriented architecture as an avenue to digital comparative philology. Because of the combined trainingdata-based and rules-based tagging strategy of the Backoff Lemmatizer, it is particularly well-suited to less-resourced languages (Piotrowski, 2012, 85). Through the use of module-level classes and languagespecific inheritance—that is, through an object-oriented approach to philological work—the CLTK, in keeping with trends related to flexible philological infrastructure (Federico Boschetti and Angelo Marco del Grosso 2014/2015, 3.3) and multilingual design patterns (Crane et al. 2009), is building an integrated platform for comparative digital work on historical languages that can extend the functionality of existing tools to languages beyond Latin and Ancient Greek.