The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)

Lead PIs

David Smith
Dr. Matthew Thomas Miller, University of Maryland, College Park
Sarah Bowen Savant, Aga Khan University – UK
Raffaele Viglianti, University of Maryland, College Park

Co PIs

Alejandro Toselli
Jacob Murel
Ryan Muther
Şaban Ağalar, University of Maryland, College Park
Jonathan Parkes Allen, University of Maryland, College Park
John Mullan, University of Maryland, College Park
Mehdy Sedaghat Payam, University of Maryland, College Park
Masoumeh Seydi, University of Leipzig

Abstract

The textual production of the diverse premodern Islamicate cultures stretching from modern Spain to South Asia is one of the most prolific in human history, making it an ideal candidate for digitally enhanced methods of search and analysis. Yet, our ability to bring these diverse literary traditions into the digital realm has been repeatedly frustrated by the underperformance of the currently available OCR solutions for Arabic-script languages. On top of this poor performance, several of them are also prohibitively expensive for academic users and offer limited trainability. For this reason, by late 2016 OpenITI’s work increasingly began to focus on the development of improved open-source OCR for Persian and Arabic through a collaboration with Benjamin Kiessling, then a computer scientist at Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities (now at Université Paris Sciences et Lettres). This collaboration led to a series of studies on Kiessling’s new OCR engine, Kraken, which demonstrated that its neural network-based approach to OCR (now implemented in Tesseract 4’s most recent release too) could routinely achieve accuracy rates on Persian and Arabic print books greater than 97% and, at times, even greater than 98%.

Further Description

Funding

The Andrew W. Mellon Foundation

Khoury College Class of 2025 Celebration

Dean’s Welcome To Our Community

Experiential Learning

Global Campus Experience

Redesigned introductory computing courses

NDIF at Northeastern: $9 million NSF grant to launch groundbreaking project

Hiring a co-op student: What to know

Careers at Khoury College

The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)

Lead PIs

Co PIs

Abstract

Funding