A data management student presents a research project titled "Solving Reverse Data Management Problems as Fast as Theoretically Possible" in a large auditorium

Data Management at Khoury College of Computer Sciences

Exploring ways to efficiently store, organize, and analyze data

Data management research is key to the technology that keeps the world going. Computer systems of all kinds rely on some form of data, and as computing extends into many more aspects of life, both the amount and complexity of data are growing exponentially. Consider, for example, that the latest iPhone can hold the equivalent of approximately 13,000 Mac Classic hard drives from just a generation ago.

Understanding how to manage these massive collections of data and how to make them usable are formidable research challenges. These challenges are intensified by the constant evolution of data types, which now include social media feeds, sensor readings, mobility data, and a wide array of sources from personal devices to satellite observations of climate change. Ensuring the reliability of this data and developing effective strategies for its organization, cleaning, and analysis are vital for keeping the systems working that we have all come to depend on.

Meet our faculty

A whiteboard with different mathematical variables and notes written in blue and black

A Khoury data management student works types on a computer keyboard in a college lab. A whiteboard filled with mathematical equations and notes is behind the student.

Impacting a broad range of disciplines

The increase in the amount of data has created a pressing need for computer science research on data storage solutions capable of working with Big Data, including compression algorithms, storage systems, and new approaches to accessing it. This research has a direct impact on making computer applications usable and efficient.

Data management research at Khoury College also has an impact on developing user-friendly interfaces for data, including visualizations and analysis tools that provide means for extracting meaningful insights. The increasing focus on “data-driven” decision-making in many fields means data management research is fundamental for supporting business.

Sample research areas

Data integration
Database systems
Database theory
Knowledge representation
Parallel & distributed data analysis
Data lakes
Search algorithms
Data science
Data curation and integration
Bias in datasets
Explainable AI for databases
Real-time data processing

A data management student (left) holds an award plaque while standing with Khoury faculty member Amal Ahmed (center) and Dean Beth Mynatt

Three students sitting in a computer lab discuss a project. One student is holding a stack of papers in her hand.

Current project highlights

Any-k: Optimal ranked enumeration for dynamic programs

Improved searching for data lakes

Khoury researchers have pioneered Starmie, a new approach to searching data lakes, large collections of multiple datasets from (for example) business and government, each with their own underlying structure. Effectively utilizing this information requires addressing their complex nature, which makes them difficult to search. Starmie addresses this challenge by using advanced techniques to understand the meaning of data within columns, making data lakes more searchable.

Creating new tools for mobility data science

Modern applications create massive amounts of data about how people and things move, from packages we order, GPS that guides our cars, or fitness trackers that map a walk or run. This mobility data is different from typical databases; Khoury researchers are collaborating on creating new tools designed specifically for this type of data, as well as investigating ways to preserve privacy in mobility data sets and also integrate it with AI approaches.

Unified reverse data management

Relational diagrams for explaining relational query patterns

Comparing relational languages by their logical expressiveness is well understood. Less understood is how to compare relational languages by their ability to represent relational query patterns. To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure —preserving mappings between the relational tables of queries.

Learn more

Recent research publications

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
Authors: Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, Renée Miller

This research addresses merging tables within data lakes (table union search) and describes a new approach, Starmie, a system that considers context to determine if data columns can be combined, and has potential to make data lakes much more searchable.

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V
Authors: Roee Shraga, Renée Miller

Khoury researchers are building an understanding of how datasets change over time, especially in situations where multiple people work on the same data. This paper introduces Explain-Da-V, a new framework that explains the actual meaning behind changes in data sets.

A Principled Approach for a New Bias Measure
Authors: Bruno Scarone, Alfredo Viola, Ricardo Baeza-Yates

As machine learning shapes decisions in various fields, bias in data can have serious consequences. Khoury College research proposes a new way to measure bias in datasets, along with a potential policymaker tool to reduce it.

Related labs and groups

Data Lab @ Northeastern

Faculty members

Ricardo Baeza-Yates

Ricardo Baeza-Yates is a professor of the practice and the director of research at Northeastern’s Institute for Experiential AI. He has held leadership positions in tech companies on three continents, taught in Spain and Chile, and co-wrote the best-selling textbook Modern Information Retrieval — among more than 600 other publications.
Read bio
Wolfgang Gatterbauer

Wolfgang Gatterbauer is an associate professor at Khoury College. He works on the theory of scalable data management, with the goal of expanding data management systems and enabling them to support novel functionalities.
Read bio
Mario Nascimento

Mario Nascimento is a professor of the practice and the director of Pacific Northwest research at Khoury College. His research focuses on data science, specifically spatiotemporal databases.
Read bio
Prashant Pandey

Prashant Pandey is an assistant professor at Khoury College. He researches scalable data systems with robust theoretical foundations for efficient data management, and tackles every level of that challenge, from the theoretical aspects of data structures to the practical issues of scaling data systems.
Read bio
Mirek Riedewald

Mirek Riedewald is a professor at Khoury College. His research emphasizes the design of novel, scalable data management and analysis techniques, with applications in ornithology, physics, astronomy, and mechanical and aerospace engineering, among other fields.
Read bio
Cheng Tan

Cheng Tan is an assistant professor at Khoury College. His systems and security research focuses on building verifiable outsourced services and certified neural networks.
Read bio