The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)
Lead PIs
- David Smith
- Dr. Matthew Thomas Miller, University of Maryland, College Park
- Sarah Bowen Savant, Aga Khan University – UK
- Raffaele Viglianti, University of Maryland, College Park
Co PIs
- Alejandro Toselli
- Jacob Murel
- Ryan Muther
- Şaban Ağalar, University of Maryland, College Park
- Jonathan Parkes Allen, University of Maryland, College Park
- John Mullan, University of Maryland, College Park
- Mehdy Sedaghat Payam, University of Maryland, College Park
- Masoumeh Seydi, University of Leipzig
Abstract
The textual production of the diverse premodern Islamicate cultures stretching from modern Spain to South Asia is one of the most prolific in human history, making it an ideal candidate for digitally enhanced methods of search and analysis. Yet, our ability to bring these diverse literary traditions into the digital realm has been repeatedly frustrated by the underperformance of the currently available OCR solutions for Arabic-script languages. On top of this poor performance, several of them are also prohibitively expensive for academic users and offer limited trainability. For this reason, by late 2016 OpenITI’s work increasingly began to focus on the development of improved open-source OCR for Persian and Arabic through a collaboration with Benjamin Kiessling, then a computer scientist at Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities (now at Université Paris Sciences et Lettres). This collaboration led to a series of studies on Kiessling’s new OCR engine, Kraken, which demonstrated that its neural network-based approach to OCR (now implemented in Tesseract 4’s most recent release too) could routinely achieve accuracy rates on Persian and Arabic print books greater than 97% and, at times, even greater than 98%.