Data Science at Khoury College of Computer Sciences

Understanding how we collect, organize, and make sense of data to create knowledge and support human decision-making

Big data is all around us. Extremely large and complex datasets have become a backbone of our digital life — underpinning transportation and navigation systems, detecting financial fraud, and advancing scientific research in biology, among many other spheres.

Data science is the interdisciplinary study of all aspects of data: collecting it, managing it, storing and retrieving it, analyzing it, and building systems that can mine valuable patterns and knowledge, leading to applications that help run our world. Khoury College researchers are breaking new ground in both theory and systems in data science. Together, these areas ensure that data can be efficiently managed and utilized, supporting advancements in artificial intelligence, industry, scientific research, health care, and the humanities.

Changing how we think of data

Data science’s impact on society is all around us, and its role is increasing exponentially. It has revolutionized decision-making in health care and drug discovery. Businesses depend on data science to optimize operations, manage supply chains and keep transactions secure. Scientific research has also benefited from data science, enabling faster discoveries and deeper understanding in fields like genomics, astronomy, and environmental science.

Data science research has also changed how we think of data: It is now a valuable asset in its own right, with possibilities that go beyond any one purpose or application. The ability to collect, analyze, and interpret vast amounts of information has opened new avenues for innovation and problem-solving generally, crossing disciplines and bridging boundaries.

Current research areas

  • Business and predictive analytics
  • Computational epidemiology
  • Computational molecular biology and bioinformatics
  • Computational social science
  • Computer vision
  • Data mining
  • Database systems
  • Database theory
  • Digital humanities
  • Game analytics
  • Health informatics
  • Information retrieval
  • Information visualization
  • Knowledge representation
  • Machine learning
  • Natural language processing
  • Parallel and distributed data analysis
  • Statistics

Domains of interest

  • Developing asymptotically optimal algorithms for query evaluation and reverse data management
  • Developing asymptotically optimal algorithms for compressed knowledge representation
  • Developing visual representations of relational queries

Current project highlights

Any-k: Optimal ranked enumeration for dynamic programs

Any-k research has the potential to help computers find and rank all possible solutions to a problem. This is important in data science, as being able to efficiently work with results in this way could lead to much more efficient searches and data analysis.

May Institute on Computation and Statistics for Spectrometry and Proteomics

Northeastern’s Barnett Institute for Chemical and Biological Analysis sponsors the national May Institute in computation and statistics for mass spectrometry and proteomics.

Machine Learning Approaches Towards Risk Assessment and Prediction of Adverse Pregnancy Outcomes

This research explores what molecular, clinical, and genetic factors increase the risk of adverse pregnancy outcomes. Using large data sets from pregnant women and the power of machine learning, this research has the potential to make a direct impact on maternal health.

Recent research publications

A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations
Authors: Neha Makhija, Wolfgang Gatterbauer

This research introduces a new method using Integer Linear Programming to solve the problem of finding the smallest set of data to remove from a database to eliminate specific query results. This method can be applied to a broader range of database queries than previous method — and in some cases, it works faster.

On the Reasonable Effectiveness of Relational Diagrams: Explaining Relational Query Patterns and the Pattern Expressiveness of Relational Languages
Authors: Wolfgang Gatterbauer, Cody Dunne

This research introduces a new way to define and compare query patterns across different programming languages, leading to the development of Relational Diagrams, a visual tool that helps users understand and write database queries faster and more accurately.

Related labs and groups

Faculty members

  • Javed Aslam

    Javed Aslam is a professor at Khoury College. His research emphasizes machine learning and information retrieval, with forays into human computation, transportation, computer security, wireless networking, and medical informatics.

  • Ricardo Baeza-Yates

    Ricardo Baeza-Yates is a professor of the practice and the director of research at Northeastern’s Institute for Experiential AI. He has held leadership positions in tech companies on three continents, taught in Spain and Chile, and co-wrote the best-selling textbook Modern Information Retrieval — among more than 600 other publications.

  • Albert-László Barabási

    Albert-László Barabási is the Robert Gray Dodge Professor of Network Science and a Distinguished University Professor at Northeastern University, director of the Center for Complex Network Research, and a joint appointee within Khoury College and the College of Science. His award-winning work includes the discovery of scale-free networks and the Barabási-Albert model to explain their prevalence in natural, technological, and social systems.

  • Usama Fayyad

    Usama Fayyad is a professor of the practice at Khoury College and the director of Northeastern’s Institute for Experiential AI. A recipient of awards from the ACM and NASA, he specializes in data science, machine learning, AI, and data mining.

  • Miguel Fuentes-Cabrera

    Miguel Fuentes-Cabrera is an associate teaching professor in the Khoury College of Computer Sciences at Northeastern University, based in Oakland. Before joining Khoury College in 2023, Fuentes-Cabrera built his research career at the Oak Ridge National Laboratory in Tennessee, where he used computational physics techniques to investigate nanomaterials.

  • Wolfgang Gatterbauer

    Wolfgang Gatterbauer is an associate professor at Khoury College. He works on the theory of scalable data management, with the goal of expanding data management systems and enabling them to support novel functionalities.

  • Eric Gerber

    Eric Gerber is an assistant teaching professor at Khoury College. His data science and statistics research examines sports topics, including minor league baseball prospects.

  • Fatemeh Ghoreishi

    Fatemeh Ghoreishi is an assistant professor at Khoury College, jointly appointed with the College of Engineering. Her research examines machine learning and Bayesian statistics for design and decision-making under uncertainty.

  • Yifan Hu

    Yifan Hu is a professor of the practice at Khoury College. He researches the interdisciplinary intersections of information visualization, AI, machine learning, and natural language processing, questions to quick he brings three decades of industry experience.

  • Tala Talaei Khoei

    Tala Talaei Khoei is an assistant teaching professor at Khoury College. Her research encompasses AI, deep learning, reinforcement learning, inclusivity in computer science, and data quality, visualization, interpretation, mining, and education.

  • David Lazer

    David Lazer is a University Distinguished Professor at Khoury College, jointly appointed with the College of Social Sciences and Humanities. His research investigates misinformation and political communication — especially on social networks — through the lens of computational social science.

  • Mario Nascimento

    Mario Nascimento is a professor of the practice and the director of Pacific Northwest research at Khoury College. His research focuses on data science, specifically spatiotemporal databases.

  • Prashant Pandey

    Prashant Pandey is an assistant professor at Khoury College. He researches scalable data systems with robust theoretical foundations for efficient data management, and tackles every level of that challenge, from the theoretical aspects of data structures to the practical issues of scaling data systems.

  • Predrag Radivojac

    Predrag Radivojac is a professor and associate dean of research at Khoury College. His work strives to grasp the molecular basis for higher-level phenotypes and genetic disorders, and to develop algorithms and analysis techniques related to the function of biological macromolecules, mass spectrometry proteomics, genome interpretation, and precision health.

  • Mirek Riedewald

    Mirek Riedewald is a professor at Khoury College. His research emphasizes the design of novel, scalable data management and analysis techniques, with applications in ornithology, physics, astronomy, and mechanical and aerospace engineering, among other fields.

  • Christoph Riedl

    Christoph Riedl is an associate professor at Khoury College, jointly appointed with the D’Amore McKim School of Business. His work focuses on optimal team design and management, the impact of social influence and information diffusion on social and economic networks, and the effect those networks have on human collaboration and decision-making.

  • Cheng Tan

    Cheng Tan is an assistant professor at Khoury College. His systems and security research focuses on building verifiable outsourced services and certified neural networks.

  • Olga Vitek

    Olga Vitek is the Raymond Bradford Bradstreet Professor at Khoury College, and the director of the Barnett Institute for Chemical and Biological Analysis. Her lab, which has been recognized with multiple major awards, uses statistical science, machine learning, and large-scale mass spectrometry to understand the functioning of living organisms.

  • Alessandro Vespignani

    Alessandro Vespignani is the Sternberg Distinguished University Professor, and an interdisciplinary appointee between Khoury College and the Bouvé College of Health Sciences. He uses statistical and numerical simulation methods to study the behavior of complex biological, social, and technological networks.

  • Shuo Zhang

    Shuo Zhang is an assistant professor at Khoury College, jointly appointed with the College of Social Sciences and Humanities. Her research examines how labor economics, platform design, algorithmic fairness, and human behavior influence online job markets.