An Integrated System For Generating And Correcting Polytonic Greek OCR

Federico Boschetti (CNR, Pisa) and Bruce Robertson (Mount Allison University, Canada)

Digital Classicist London & Institute of Classical Studies seminar 2013

Friday July 19th at 16:30, in Room S264, Senate House, Malet Street, London WC1E 7HU

Video recording of seminar (MP4)

Audio recording of seminar (MP3)

Presentation (PDF)

In many fields, the digital books revolution provides wide and highly detailed access to pertinent texts; but this revolution has left behind scholars working with ancient Greek. While it is true that Hellenists have had digitized canonical texts for many years, these collections’ relatively limited scope and restrictive licenses are increasingly at odds with recent currents in computer-based humanities research: linked data, large-scale text mining, and syntatic treebanking, to name a few. Perhaps the most important impediments to digitizing polytonic Greek have been the lack of: a high-quality optical character recognition for this script, especially under open-source licenses; and an assisted editor for polytonic Greek OCR output. In this seminar, we present a integrated system that fills these critical gap, making it possible for polytonic Greek texts to be digitized en masse.

Rigaudon OCR is a complete suite of scripts, python code and data required for producing polytonic Greek OCR. It comprises: an OCR engine based on Gamera with many features specific to the recognition of polytonic Greek and specific classifiers to identify the characters in Teubner, Teubner-sans-serif, OCT/Loeb, and Didot editions. It includes an automatic spellchecker designed to correct Greek OCR errors, and it has a process for combining existing, high-quality Latin-script OCR output with parallel Greek output, as illustrated by this papyrological text. Finally, it coordinates these processes through Sun Grid Engine scripts required to queue and parallelize these processes.

The output provided by Rigaudon OCR is post-processed and piped to CoPhi, the proof-reading web application. In it, each token is spell-checked and errors are classified as one of the following: accent error; well formed syllabic sequence not in the dictionary; and a malformed character sequence. Suggestions provided by the spell-checker are added. If another edition of the same text is available, the OCR output is aligned to it, in order to reinforce the score of a low ranked suggestion or add at the top of the list a new suggestion.

CoPhi colors errors according to their classification, in order to capture the attention of the proof-readers. The proof reader web application loads and store document in a centralized repository. By aligning the original OCR output with the corrected document, error patterns are extracted, which are useful to improve the training sets that will be applied on new documents.

In the last two months, with this system we have generated raw OCR of dozens of texts from archive.org and experimented with hundreds more. A representative example of these, achieved an overall 95% aligned OCR accuracy score in our latest self-evaluation. Aiming for an industry-standard 98% accuracy, we plan to use this system to train the Tesseract and Ocropus engines in Greek OCR as well.

ALL WELCOME

The seminar will be followed by wine and refreshments.