The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)

Lead PIs

Co PIs

Abstract

The textual production of the diverse premodern Islamicate cultures stretching from modern Spain to South Asia is one of the most prolific in human history, making it an ideal candidate for digitally enhanced methods of search and analysis. Yet, our ability to bring these diverse literary traditions into the digital realm has been repeatedly frustrated by the underperformance of the currently available OCR solutions for Arabic-script languages. On top of this poor performance, several of them are also prohibitively expensive for academic users and offer limited trainability. For this reason, by late 2016 OpenITI’s work increasingly began to focus on the development of improved open-source OCR for Persian and Arabic through a collaboration with Benjamin Kiessling, then a computer scientist at Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities (now at Université Paris Sciences et Lettres). This collaboration led to a series of studies on Kiessling’s new OCR engine, Kraken, which demonstrated that its neural network-based approach to OCR (now implemented in Tesseract 4’s most recent release too) could routinely achieve accuracy rates on Persian and Arabic print books greater than 97% and, at times, even greater than 98%.

Further Description

Funding

The Andrew W. Mellon Foundation