Textual Re-use of Ancient Greek Texts: A case study on Plato’s works

Marco Büchler & Annette Loos (eAqua Project, Leipzig)

Digital Classicist/ICS Work in Progress Seminar, Summer 2009

Friday 26th June at 16:30, in room STB3/6, Senate House, Malet Street, London WC1E 7HU

"Users of this or any edition are warned that the textual variants presented by citations from Plato in later literature have not yet been as fully investigated as is desirable." This shortcoming, characterized by Kenneth Dover (Plato Symposium, Cambridge, 1980, VII) is still existent and is unlikely to be corrected quickly with the help of traditional techniques of research. Textual reuse plays an important role in research of Classical Studies. Similar to modern publications authors are using texts of others as source for the own work. However, in ancient texts stronger word by word citations can be observed. Within the eAQUA project we investigate the reception of Plato as a case study of textual reuse on ancient Greek texts. Our research in eAQUA is carried out in three steps. First we extract word by word citations. This is being done by combining n-gram overlappings and significant terms for several of Plato’s works. In the second step the constraints on syntactic word order are being relaxed. This is being done by combining text mining and information retrieval techniques. On the one hand, a positional inverted list is used for selecting only reuse candidates with a small set of non common matching words within a citation. On the other hand, a complete pairwise comparison of all about 5.5 million sentences in the TLG corpus would need approximately about 1000 years caused by squared complexity of O(n2). For that, an intelligent pre-clustering of relevant reuse candidates is needed. Such a divide and conquer strategy dramatically reduces the complexity. Whilst the second step only increases the degree of free word order, in the third step the algorithm is expanded by similarly used words like go and walk. Those candidates are computed by similar cooccurrence profiles. The three levels shortly described above are only one dimension of reuse exploration.

Other relevant dimensions that will be discussed are the degree of preprocessing as well as the visualisation of textual reuse in terms of citations. In the field of preprocessing the main focus is on tokenisation (more active tokenisation is needed on ancient texts than on modern languages), normalisation (reducing all words internally to a lower-case representation without diacritics) and lemmatisation (reducing all words internally to a word's base form). This dimension can speed up the algorithm and also improve the results for strongly inflected languages like Ancient Greek. The visualisation dimension of textual reuse is important since text mining approaches typically compute a huge amount of data which can't be explored manually. Within the WIP seminar we will discuss the technical realisation and efficiency of the above mentioned dimensions and apply them in the field of the Plato's aftermath. Based on substantial experience of an ongoing collaboration between researchers of Classical Studies and Computer Science we shall also reflect on the different approaches to working with text.


Audio recording of seminar (MP3)

Presentation (PDF)