Bringing Modern Spell Checking Approaches to Ancient Texts: Automatized Suggestions for Incomplete Words

Marco Büchler (Leipzig)

Institute of Classical Studies Digital Seminar 2011

Friday July 29th at 16:30, in Room 37, Senate House, Malet Street, London WC1E 7HU

One of the most challenging tasks for scholars working with ancient data is the completion of the texts that have only been partially preserved. In the current situation, a great deal of scholarly experience and the use of dictionaries such as Liddell Scott Jones or Lewis & Short are necessary to perform the task of text reconstruction manually. Even though text search tools such as Diogenes or papyri.info exist, scholars still have to work through the results manually and require a very good knowledge about the text, its cultural background and its documentary form in order to be able to decide about the correct reconstitution of the damaged text. Therefore, a “Selective and relatively Small Scope” especially of younger scholars restricts the set of potential candidates.

In this presentation, an unsupervised approach from the field of machine learing is introduced for recommending words based on several classes of spell checking (Kukich 1992, Schierle et al. 2008) and text mining algorithms. While the applications of both spell checking and text completion can be separated into two main tasks, the first step of this approach is to identify an incorrect or incomplete word. Although this can be a very difficult task when working with modern texts (such as with spell checking support provided by modern word processing suites), existing sigla of the Leiden Conventions (Bodard et al. 2009) can be used when dealing with ancient texts. The second step of the process is then to generate likely suggestions using methods such as:

  • Semantic approaches: Sentence co-occurrences (Buechler 2008) and document co-occurrences (Heyer et al. 2008) are used to identify candidates based on different contextual windows (Bordag 2008). The basic idea behind this type of classification is motivated by Firth's famous statement about a word's meaning: “You shall know a word by the company it keeps.” (Firth 1957).
  • Syntactical approaches: Word bi- and trigrams (Heyer et al. 2008): With this method, the immediate neighbourhood of a word is observed and likely candidates are identified based on a selected reference corpus.
  • Morphological dependencies: Similar to the Latin and Greek Treebank of Perseus (Crane et al. 2009) morphological dependencies are used to suggest words by using an expected morphological code.
  • String based approaches: The most common class of algorithms for modern texts compares words by their word similarity on letter level. Different approaches like the Levenshtein algorithm (Ottmann et al. 1996W) or more modern and faster approaches such as FastSS (Bocek et al. 2007) are used to compare a fragmentary word with all candidates.
  • Named Entity lists: With a focus on deletions of inscriptions, existing and extended named entity lists for person names, cities or demonyms like the Lexicon of Greek Personal Names (Fraser et al. 1987-2008) or the Wörterlisten of Dieter Hagedorn are used to look for names of persons and places and give them a higher probability.
  • Word properties: When focusing on Stoichedon texts, word length is a relevant property. For this reason the candidate list can be restricted by both exact length as well as by min-max thresholds.

From a global perspective, every found word in a vocabulary is a potential suggestion candidate. To reduce this list of anywhere from several hundred thousand to several million words to a more reasonable size, the results of all selected algorithms are combined to a normalised score between 0 and 1 (Kruse 2009). In the last working step of this process, the candidates list (ordered by score in descending order) is then provided to the user.

A demonstration video of the current implementation can be viewed online.

References

  • [Bocek et al. 2007] Bocek, T., Hunt, E., Stiller, B.: Fast Similarity Search in Large Dictionaries, 2007, Department of Informatics, University of Zurich.
  • [Bodard et al. 2009] Gabriel Bodard et al., EpiDoc Cheat Sheet: Krummrey-Panciera sigla & EpiDoc tags, 2006-2009. Version 1085, last accessed: Nov., 10th, 2009 [date] URL: http://epidoc.svn.sourceforge.net/viewvc/epidoc/trunk/guidelines/msword/cheatsheet.doc.
  • [Bordag 2008] Stefan Bordag: A Comparison of Co-occurrence and Similarity Measures as Simulations of Context, 2008, In: CICLing Vol. 4919, Springer, 2008 (Lecture Notes in Computer Science).
  • [Buechler 2008] Büchler, M. Medusa: Performante Textstatistiken auf großen Textmengen: Kookkurrenzanalyse in Theorie und Anwendung, Vdm Verlag Dr. Müller, 2008. ISBN-10: 3639011252.
  • [Crane et al. 2009] Crane, G., Bamman, D. The Latin and Ancient Greek Dependency Treebanks, 2009. URL: http://nlp.perseus.tufts.edu/syntax/treebank/ last accessed: Nov., 10th 2009.
  • [Firth 1957] Firth, J. R., A Synopsis of Linguistic Theory, 1957.
  • [Fraser et al 1987-2008] Fraser, Peter M.; Matthews, E.; Osborne, Michael J. (1987)-2008) (in Greek and English). A Lexicon of Greek Personal Names (Vol. 1-5, Suppl.), Oxford u.a. Clarendon Press PPN: 01317276X
  • [Heyer et al. 2008] Heyer, G., Quasthoff, U. and Wittig, T. Text Mining: Wissensrohstoff Text – Konzepte, Algorithmen, Ergebnisse. W3L-Verlag, 2nd edition, 2008.
  • [Kruse 2009] Kruse, Sebastian (2009) (in German). Textvervollständigung auf antiken Texten. University of Leipzig, Bachelor Thesis. pp 48-49. URL http://www.eaqua.net/~skruse/bachelor, last accessed on Nov., 10th 2009.
  • [Kukich 1992] K. Kukich: K. Technique for Automatically Correcting Words in Text. 1992 ACM Computing Surveys 24, Nr. 4.
  • [Ottmann et al. 1996] T. Ottmann, P. Widmayer: Algorithmen und Datenstrukturen, Spektrum Verlag.
  • [Schierle et al. 2008] Martin Schierle, Sascha Schulz, Markus Ackermann: From Spelling Correction to Text Cleaning - Using Context Information, 2008, In: Data Analysis, Machine Learning and Applications: Proceedings of the 31st Annual Conference of the Gesellschaft für Klassifikation e.V., 2008.

ALL WELCOME

The seminar will be followed by wine and refreshments.

Audio recording of seminar (MP3)

Presentation (PDF)