Historical Text Re-use Detection on Perseus Digital Library

Marco Buchler and Gregory Crane (Leipzig)

Institute of Classical Studies Digital Seminar 2012

Friday June 29th at 16:30, in Room G37, Senate House, Malet Street, London WC1E 7HU

In this paper we introduce the research collaboration between Perseus (Humanities) and eTRACES (Computer Science) in the field of text re-use detection such as uncovering quotations, paraphrases, allusions, or even analogies and translations. Since quoting never happens by chance but by a positive (agree with an earlier author’s text) or negative (disagree) purpose, we propose to use text re-use techniques for quantitative generation of text re-use graphs and identifying from them hotly quoted passages of a work that are used for scoring an information retrieval result.

Recent research has focused on two themes: First, how can humanists cope in general with the vast body of data now available to them? Second and more specifically, in order to gradually select “relevant” text passage, facetted search techniques are used to restrict the search result. But this has lead to two significant problems. On the one hand, available meta data such as dating information or authorship attribution, even when derived from library catalogues, is imperfect and makes good selections almost impossible. On the other hand, the use of facetted search techniques has largely removed the serendipity effect of finding unexpected association. Furthermore, relevance feedback or user profiling is used that leaves out historical languages such as Latin and Greek, because there are no native speakers for these languages anymore.

Following a brief introduction of the Greek data and the TRACER tool, the focus is on the text re-use graph/network that is completely created by quantitative approaches. In detail, we focus on folding the graph to generate so called temperature maps (cf. figure 1) that highlight which text passages have been hotly quoted and which have not. Finally, we describe how it is used for ranking search results in an information retrieval system in view of the ‘cultural heritage’ in historical texts.


Audio recording of seminar (MP3)

Presentation (PDF)