A data management student presents a research project titled "Solving Reverse Data Management Problems as Fast as Theoretically Possible" in a large auditorium

Data Management at Khoury College of Computer Sciences

Exploring ways to efficiently store, organize, and analyze data

Data management research is key to the technology that keeps the world going. Computer systems of all kinds rely on some form of data, and as computing extends into many more aspects of life, both the amount and complexity of data are growing exponentially. Consider, for example, that the latest iPhone can hold the equivalent of approximately 13,000 Mac Classic hard drives from just a generation ago.

Understanding how to manage these massive collections of data and how to make them usable are formidable research challenges. These challenges are intensified by the constant evolution of data types, which now include social media feeds, sensor readings, mobility data, and a wide array of sources from personal devices to satellite observations of climate change. Ensuring the reliability of this data and developing effective strategies for its organization, cleaning, and analysis are vital for keeping the systems working that we have all come to depend on.

Impacting a broad range of disciplines

The increase in the amount of data has created a pressing need for computer science research on data storage solutions capable of working with Big Data, including compression algorithms, storage systems, and new approaches to accessing it. This research has a direct impact on making computer applications usable and efficient.

Data management research at Khoury College also has an impact on developing user-friendly interfaces for data, including visualizations and analysis tools that provide means for extracting meaningful insights. The increasing focus on “data-driven” decision-making in many fields means data management research is fundamental for supporting business.

Sample research areas

  • Data integration
  • Database systems
  • Database theory
  • Knowledge representation
  • Parallel & distributed data analysis
  • Data lakes
  • Search algorithms
  • Data science
  • Data curation and integration
  • Bias in datasets
  • Explainable AI for databases
  • Real-time data processing
A data management student (left) holds an award plaque while standing with Khoury faculty member Amal Ahmed (center) and Dean Beth Mynatt

Current project highlights

Any-k: Optimal ranked enumeration for dynamic programs

Any-k research has the potential to help computers find and rank all possible solutions to a problem. This is important in data science, as being able to efficiently work with results in this way could lead to much more efficient searches and data analysis.

Learn more

Improved searching for data lakes

Khoury researchers have pioneered Starmie, a new approach to searching data lakes, large collections of multiple datasets from (for example) business and government, each with their own underlying structure. Effectively utilizing this information requires addressing their complex nature, which makes them difficult to search. Starmie addresses this challenge by using advanced techniques to understand the meaning of data within columns, making data lakes more searchable.

Creating new tools for mobility data science

Modern applications create massive amounts of data about how people and things move, from packages we order, GPS that guides our cars, or fitness trackers that map a walk or run. This mobility data is different from typical databases; Khoury researchers are collaborating on creating new tools designed specifically for this type of data, as well as investigating ways to preserve privacy in mobility data sets and also integrate it with AI approaches.

Unified reverse data management


What computational resources do we need in order to find a representation that is logically equivalent but minimal in size? One concrete problem we are studying is finding the minimal-size factorization of the provenance of database queries.

Relational diagrams for explaining relational query patterns

Comparing relational languages by their logical expressiveness is well understood. Less understood is how to compare relational languages by their ability to represent relational query patterns. To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure —preserving mappings between the relational tables of queries. 

Recent research publications

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
Authors: Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, Renée Miller

This research addresses merging tables within data lakes (table union search) and describes a new approach, Starmie, a system that considers context to determine if data columns can be combined, and has potential to make data lakes much more searchable.

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V
Authors: Roee Shraga, Renée Miller

Khoury researchers are building an understanding of how datasets change over time, especially in situations where multiple people work on the same data. This paper introduces Explain-Da-V, a new framework that explains the actual meaning behind changes in data sets.

A Principled Approach for a New Bias Measure
Authors: Bruno Scarone, Alfredo Viola, Ricardo Baeza-Yates

As machine learning shapes decisions in various fields, bias in data can have serious consequences. Khoury College research proposes a new way to measure bias in datasets, along with a potential policymaker tool to reduce it.

Related labs and groups

Faculty members

  • Ricardo Baeza-Yates

    Ricardo Baeza-Yates is a professor of the practice and the director of research at Northeastern’s Institute for Experiential AI. He has held leadership positions in tech companies on three continents, taught in Spain and Chile, and co-wrote the best-selling textbook Modern Information Retrieval — among more than 600 other publications.

  • Wolfgang Gatterbauer

    Wolfgang Gatterbauer is an associate professor at Khoury College. He works on the theory of scalable data management, with the goal of expanding data management systems and enabling them to support novel functionalities.

  • Mario Nascimento

    Mario Nascimento is a professor of the practice and the director of Pacific Northwest research at Khoury College. His research focuses on data science, specifically spatiotemporal databases.

  • Prashant Pandey

    Prashant Pandey is an assistant professor at Khoury College. He researches scalable data systems with robust theoretical foundations for efficient data management, and tackles every level of that challenge, from the theoretical aspects of data structures to the practical issues of scaling data systems.

  • Mirek Riedewald

    Mirek Riedewald is a professor at Khoury College. His research emphasizes the design of novel, scalable data management and analysis techniques, with applications in ornithology, physics, astronomy, and mechanical and aerospace engineering, among other fields.

  • Cheng Tan

    Cheng Tan is an assistant professor at Khoury College. His systems and security research focuses on building verifiable outsourced services and certified neural networks.