Crowdsourcing a digital library of pre-modern Chinese

Donald Sturgeon (Harvard University)

Digital Classicist London seminar 2017

Friday June 9th at 16:30, in room 234, Senate House, Malet Street, London WC1E 7HU

Livecast at Digital Classicist London YouTube channel.

Traditional digital libraries, including those in the field of pre-modern Chinese, have typically followed top-down, centralized, and static models of content creation and curation. This is a natural and well-grounded strategy for database design and implementation, with strong roots in traditional academic publishing models, and offering clear technical advantages over alternative approaches. This strategy, however, is unable to adequately meet the challenges of increasingly large-scale digitization and the resulting rapid growth in available corpus size.

In this talk I present a working example of a dynamic alternative to the conventional static model. This alternative leverages a large, distributed community of users, many of whom may not be affiliated with mainstream academia, to curate material in a way that is distributed, scalable, and does not rely upon centralized editing. In the particular case presented, initial transcriptions of scanned pre-modern works are created automatically using specially developed OCR techniques and immediately published in an online open access digital library platform called the Chinese Text Project. The online platform uses this data to implement full-text search, image search, full-text export and other features, while simultaneously facilitating correction of initial OCR results by a geographically distributed group of pseudonymous volunteer users. The online platform described is currently used by around 25,000 individual users each day. User-submitted corrections are immediately applied to the publicly available version-controlled transcriptions without prior review, but are easily validated visually by other users using simple semi-automated mechanisms. This approach allows immediate access to a “long tail” of less popular and less mainstream material which would otherwise likely be overlooked for inclusion in this type of full-text database system. To date the procedure described has been applied to over 25 million pages of historical texts, including 5 million pages from the Harvard-Yenching Library collection, and the complete results published online.

In addition to the online platform, the development of an open plugin system and API allowing customization of the user interface with user-defined extensions and immediate machine-readable access to full-text data and metadata have made possible many further use cases. These include efficient, distributed collaboration and integration with other online web platforms including projects based at Leiden University, Academia Sinica and elsewhere, as well as use in data mining, digital humanities research and teaching, and as a self-service tool for use in projects requiring the creation of proofread transcriptions of particular early texts. A Python library has also been created to further encourage use of the API; in the final part of the talk I explain how the API together with this Python library are currently being used to facilitate – and greatly simplify – digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.