Our diagram research began in the 1980s with two important components: a strategy for parsing diagrams and spatial indexes that allow fast evaluation of the spatial relations needed for parsing, such as near, above, aligned, etc. The culmination of much of the work of this period was the PhD dissertation by Nikos Nikolakis. A brief description of the work is in our 1995 paper (PDF) as well as in our demo site that walks you through our system. There was an unfortunate hiatus in my research for a few years starting in 1995. When the research picked up again, it was difficult, for various technical reasons, to pick up the threads of Nikolakis' research. Things picked up again in 1998, producing publications on a variety of aspects of diagrams as well as our research on natural language. (See our Papers page.) Another PhD student, Mingyan Shao, finished her dissertation on machine learning for diagram classification, using both supervised and unsupervised learning techniques.
Our current diagrams research has four components:
- A complete redevelopment of our diagram software system using Java, with persistence to relational databases, Apache Derby for now.
- Partial parsing for machine learning.
- Vectorization of raster images of diagrams.
For #1, we are beginning by processing diagrams available in PDFs from our BioMed Central corpus. We are beginning with diagrams available in vector format, ones represented by discrete instructions such as moveTo(), lineTo(), stroke(), fill(), etc.
For #2, we accumulate statistics from basic elements and simple patterns and use these for diagram classification. The goal of this work is obvious - when diagrams can be classified, we can build retrieval systems that focus in on specific types.
For #3, we recognize that the majority of figures in the literature are not available in vector format, but only as raster images, e.g., a jpeg image, 300 x 400 pixels. These are obtained from the HTML form of papers. From those 120,000 pixels, the vectorization software we are developing has to extract the various elements such as lines, curves, filled regions, and text characters. Our first work on vectorization was published as, Moment-derived Object Models for Vectorization (2005). Our current approach is modular and quite different from the 2005 paper.
As we have said, figures and text work together to tell the whole story. Our current NLP research has a significant component that looks at figure captions as well as discussions of figures in the text proper.