![an image showing white dots on top of a blue and green background](https://s32615.pcdn.co/wp-content/uploads/2024/02/default-hero.jpg)
![Need to add an ALT text field to the CMS for the Hero Image](https://s32615.pcdn.co/wp-content/uploads/2024/02/default-hero.jpg)
The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)
![an image showing white dots on top of a blue and green background](https://s32615.pcdn.co/wp-content/uploads/2024/02/default-hero.jpg)
Thu 08.26.21
![an image showing white dots on top of a blue and green background](https://s32615.pcdn.co/wp-content/uploads/2024/02/default-hero.jpg)
The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)
Thu 08.26.21
Thu 08.26.21
Thu 08.26.21
Thu 08.26.21
Thu 08.26.21
The textual production of the diverse premodern Islamicate cultures stretching from modern Spain to South Asia is one of the most prolific in human history, making it an ideal candidate for digitally enhanced methods of search and analysis. Yet, our ability to bring these diverse literary traditions into the digital realm has been repeatedly frustrated by the underperformance of the currently available OCR solutions for Arabic-script languages. On top of this poor performance, several of them are also prohibitively expensive for academic users and offer limited trainability. For this reason, by late 2016 OpenITI’s work increasingly began to focus on the development of improved open-source OCR for Persian and Arabic through a collaboration with Benjamin Kiessling, then a computer scientist at Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities (now at Université Paris Sciences et Lettres). This collaboration led to a series of studies on Kiessling’s new OCR engine, Kraken, which demonstrated that its neural network-based approach to OCR (now implemented in Tesseract 4’s most recent release too) could routinely achieve accuracy rates on Persian and Arabic print books greater than 97% and, at times, even greater than 98%.
The textual production of the diverse premodern Islamicate cultures stretching from modern Spain to South Asia is one of the most prolific in human history, making it an ideal candidate for digitally enhanced methods of search and analysis. Yet, our ability to bring these diverse literary traditions into the digital realm has been repeatedly frustrated by the underperformance of the currently available OCR solutions for Arabic-script languages. On top of this poor performance, several of them are also prohibitively expensive for academic users and offer limited trainability. For this reason, by late 2016 OpenITI’s work increasingly began to focus on the development of improved open-source OCR for Persian and Arabic through a collaboration with Benjamin Kiessling, then a computer scientist at Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities (now at Université Paris Sciences et Lettres). This collaboration led to a series of studies on Kiessling’s new OCR engine, Kraken, which demonstrated that its neural network-based approach to OCR (now implemented in Tesseract 4’s most recent release too) could routinely achieve accuracy rates on Persian and Arabic print books greater than 97% and, at times, even greater than 98%.