Table-as-Query: Unifying Data Discovery and Alignment
Mon 07.05.21
Table-as-Query: Unifying Data Discovery and Alignment
Mon 07.05.21
Mon 07.05.21
Mon 07.05.21
Mon 07.05.21
Mon 07.05.21
Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open data sharing. The requirements for data discovery and integration over heterogeneous table repositories containing structured data are fundamentally different than they are for federated data integration (where for example, all data within an enterprise is integrated) or data exchange (where data is exchanged among a small set of autonomous peers, for example, between two institutions). This project will lay the theoretical foundations of data discovery (identification, alignment, and integration of tables) within table repositories. It will contribute both to developing the right conceptual framework for studying this problem and to designing systems that solve the table discovery and alignment problems at scale.
Today, solutions for data discovery over massive table repositories are in their infancy. Some solutions are highly tied to a specific domain. For example, solutions for finding relevant tables in mass collaboration data (often called web tables) may assume tables are designed for human consumption with rich, human-readable attribute names or metadata, and are relatively small (being designed for display on web pages). Furthermore, solutions often assume that the data scientists know a lot about what data is available and exactly how they want to integrate it with known data. These solutions let a user find tables that join with a specified attribute or union with a query table. But they are inadequate if the best way to extend a query table is to actually join it on several attributes with two other tables and then union the extended result with an existing wider table. This project will develop a more holistic approach to table discovery that both discovers a set of alignable tables as well as the best way to integrate (or align) the new data with a query table. In this new paradigm called “table-as-query”, the user does not need to know a priori on which attributes various tables in a repository are best aligned. This project promotes a research agenda under which discovery finds not a single table, but a set of tables that can be combined (aligned) with the query table. The solutions will include integration choices within the table discovery process, looking for a set of tables that can best be aligned with a query table and also finding what the best alignment is. Importantly, the project will not rely on the unique name assumption, which states that different values refer to different and unique entities. Real data contains synonyms (two values that refer to the same entity) and homographs (one value that refers to more than one entity). This project will define new foundations and mathematical principles for studying table alignment and discovery. The search space is massive, so the project will also develop approximate, scalable solutions that can quickly (at interactive speeds) find a good set of tables and good alignments over massive table repositories with millions of tables.
Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open data sharing. The requirements for data discovery and integration over heterogeneous table repositories containing structured data are fundamentally different than they are for federated data integration (where for example, all data within an enterprise is integrated) or data exchange (where data is exchanged among a small set of autonomous peers, for example, between two institutions). This project will lay the theoretical foundations of data discovery (identification, alignment, and integration of tables) within table repositories. It will contribute both to developing the right conceptual framework for studying this problem and to designing systems that solve the table discovery and alignment problems at scale.
Today, solutions for data discovery over massive table repositories are in their infancy. Some solutions are highly tied to a specific domain. For example, solutions for finding relevant tables in mass collaboration data (often called web tables) may assume tables are designed for human consumption with rich, human-readable attribute names or metadata, and are relatively small (being designed for display on web pages). Furthermore, solutions often assume that the data scientists know a lot about what data is available and exactly how they want to integrate it with known data. These solutions let a user find tables that join with a specified attribute or union with a query table. But they are inadequate if the best way to extend a query table is to actually join it on several attributes with two other tables and then union the extended result with an existing wider table. This project will develop a more holistic approach to table discovery that both discovers a set of alignable tables as well as the best way to integrate (or align) the new data with a query table. In this new paradigm called “table-as-query”, the user does not need to know a priori on which attributes various tables in a repository are best aligned. This project promotes a research agenda under which discovery finds not a single table, but a set of tables that can be combined (aligned) with the query table. The solutions will include integration choices within the table discovery process, looking for a set of tables that can best be aligned with a query table and also finding what the best alignment is. Importantly, the project will not rely on the unique name assumption, which states that different values refer to different and unique entities. Real data contains synonyms (two values that refer to the same entity) and homographs (one value that refers to more than one entity). This project will define new foundations and mathematical principles for studying table alignment and discovery. The search space is massive, so the project will also develop approximate, scalable solutions that can quickly (at interactive speeds) find a good set of tables and good alignments over massive table repositories with millions of tables.