Towards a Tool for the Automatic Extraction of Canonical References

Matteo Romanello (King’s College London)

Digital Classicist and Institute of Classical Studies Seminar 2010

Friday June 25th at 16:30, in room STB9, Senate House, Malet Street, London WC1E 7HU

Classicists usually refer to primary sources by the means of abbreviated references that are called canonical references. Linking together primary sources (i.e. ancient texts) and secondary sources (i.e. modern publications related to ancient texts) contained in the Digital Library (DL) is still an open issue. Indeed, although such references are already links in potentia, most of the times there are no hypertextual links back and forth from primary and secondary sources available on the Web. Some DLs already provide their users with links from secondary to primary sources. For instance, in the Perseus Digital Library the entries of the electronic version of the Greek-English Lexicon by H.G. Liddell, R. Scott and H.S. Jones [1] are linked to the correspondent text source and vice versa. However, this was done by manual, and thus highly time-expensive, encoding. As an attempt to cope with the automatic extraction of this kind of references this paper presents CREFEX, a machine learning-based Canonical REFerences Extractor.

Recent studies in Digital Classics faced different aspects of how to represent canonical references in a digital environment thus highlighting their importance for scholars studying classical texts. (Ruddy & Rebillard 2009; Romanello 2007) focused from different perspectives on how to properly encode their semantics into HTML markup, whereas (Smith 2009) proposed a network-based protocol to transform citations to primary sources into machine actionable links. Moreover (Romanello 2008) proposed new value added services for electronic publications to be built upon a system linking together primary and secondary sources.

The main goal of this paper is to give a foundation to the task of extracting canonical references by devising a first tool for this purpose. In the NLP field, the performances of new algorithms/tools for a given task (such as named entities recognition or part of speech tagging) are usually compared with those of the already existing ones in order to assess the improvements reached. One of the difficulties of presenting such a tool is that so far there were no already existing tools it could be compared with.

The Canonical REFerences Extractor (CREFEX) is a first prototype of a machine learning-based tool to extract canonical references (Romanello et al. 2009). It uses a binary classifier of word level ngrams in order to determine - given a specific set of training examples- whether a given sequence of tokens (i.e. word level ngram) is a canonical reference or not. The references identified are then extracted from the text sequence. As of know, CREFEX is based on a CRF-based classifier which is trained to recognize canonical references by looking at distinctive features. CRF (Conditional Random Fields) is a state of the art statistical model for sequence classification (Lafferty et al. 2001) and it was used to implement tools for the extraction of bibliographic references (Councill & Kan 2008). The main advantage of having such a web service is that it might be easily integrated into other existing projects.

Councill, C.L.G.I. & Kan, M., 2008. ParsCit: an Open-source CRF Reference String Parsing Package. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech, Morocco: European Language Resources Association (ELRA). Available at:


Lafferty, J., Mccallum, A. & Pereira, F., 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp. 289, 282. Available at: [Accessed April 18, 2009].

Romanello, M., 2008. A semantic linking framework to provide critical value-added services for E-journals on classics. In S. Mornati & L. Chan, eds. ELPUB2008. Open Scholarship: Authority, Community, and Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada 25-27 June 2008. ELPUB2008. pp. 401-414. Available at: [Accessed August 11, 2008].

Romanello, M., 2007. A Semantic Linking System for Canonical References to Electronic Corpora. In P. Zemánek, ed. International Conference on Electronic Corpora of Ancient Languages. Prague: Charles University, pp. 107-120. Available at:

Romanello, M., Boschetti, F. & Crane, G., 2009. Citations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. Suntec City, Singapore: Association for Computational Linguistics, pp. 80–87. Available at:

Ruddy, D. & Rebillard, E., 2009. Text Linking in the Humanities: Citing Canonical Works Using OpenURL. Available at: [Accessed September 11, 2009].

Smith, N., 2009. Citation in Classical Studies. Digital Humanities Quarterly, 3(1). Available at: [Accessed March 15, 2009].


The seminar will be followed by wine and refreshments.

Audio recording of seminar (MP3)

Presentation (PDF)