Analysis of Named Entity Recognition & Entity Linking in Historical Text – Kunal Asarsa – 8.12.16
Master’ Thesis Defense
Abstract
Named Entity Recognition is a Natural Language Processing (NLP) technique that is used to identify names of people, places, organizations, and more from a given piece of text. Entity Linking is the additional NLP task that involves connecting a machine identified Named Entity to a Knowledge Base Entity.
Both these techniques have had fair share of success with more recent content, where for example names have been linked to entities on Wikipedia (a process called “wikification”). However, parsers, models and other NLP tools tend to act a little differently with historical text. As per our initial research, there are often issues like ‘names that do not exist in models’ or ‘words that are no longer used’, ‘words that are now spelled in a different way’ and more. This study aims to minimize the effect of such differences by modifying the NLP processes and then comparing the manual tagging of a sample corpus versus the tagging and linking performed by the machine.
With this study, we aim to arrive at changes required in NLP tools and/or inclusion of additional steps to better handle historical text and compare this to the performance of the tools without modification. We hope that this study cannot only provide statistical results but also findings that can potentially help expedite & improve the process of handling historical text.
Advisor
David Smith