Papyrology and Linguistic Annotation: How can we make TEI EpiDoc XML corpus and Treebanking work together?
Marja Vierros (Helsinki)
Digital Classicist London & Institute of Classical Studies seminar 2014
Friday July 25th at 16:30, in Room G35, Senate House, Malet Street, London WC1E 7HU
Greek documentary papyri provide a rich source for linguists who wish to study Ancient Greek as it was written in everyday texts, preserved directly from antiquity. Papyrologists are lucky in that their material is compiled into a comprehensive digital corpus (the Duke Databank of Documentary Papyri), and that there is a scholarly web resource which enables browsing and searching this corpus together with many other digital papyrological resources (Papyrological Navigator). However, for a linguist studying phonological or morpho-syntactic variation, for example, this resource does not provide much help in finding structures or patterns of linguistic usages. Linguistically annotated digital corpus is therefore a desideratum.
Some Ancient Greek literature and the Greek New Testament have been annotated linguistically. The Ancient Greek Dependency Treebank under the Perseus Digital Library includes Homer, Hesiod, Greek drama and some Plato, and the PROIEL parallel corpus includes the New Testament in Greek and other old Indo-European languages; recently they have added other literature as well, e.g. Herodotus in Greek. Both projects use a treebank method; Perseus has developed an annotation environment Alpheios, which is based on the Prague Dependency Treebank model and PROIEL has developed the Perseus annotation system into their own needs.
Since we have these annotation environments ready, I wanted to try how they would serve the papyrological material and my interest in studying variation and contact-induced language phenomena. The papyrological texts in the corpus are coded in TEI EpiDoc XML and the Alpheios Treebanking environment works in XML, too. EpiDoc mark-up deals with the fragmentary nature of the papyri (the gaps, supplied letters, abbreviations etc.) and tags these matters letter by letter when needed. In Alpheios, however, a word is an essential element. Thus, for my test corpus of 50 papyrus texts, we needed a XSLT Stylesheet to strip those EpiDoc tags that broke up words before the texts could be worked with in Alpheios (which provides word-id’s automatically for every word). Unfortunately, that meant the loss of important papyrological information on what in fact survives in the texts and what is added by the editors. Moreover, Alpheios lacks the possibility to mark up several features that I would like to annotate, e.g. phonological and other variants (mostly found in EpiDoc as <reg> vs. <orig> tags; <corr> vs. <sic> tags). In this seminar, I wish to discuss certain solutions I have in mind for combining the Alpheios phase (providing a good morphological and syntactic tool) with an additional mark-up phase for linguistic variation; and how to make the resulting linguistically annotated XML corpus to talk with the Papyrological Navigator in order to maintain the papyrological information near at hand for the linguist, too.
The seminar will be followed by wine and refreshments.