Mirek Riedewald

Professor
Northeastern University
Khoury College of Computer Sciences, 202 West Village H
360 Huntington Avenue
Boston, MA 02115

+1-617-373 4766

Expertise

Cloud computing, distributed big-data management and analysis, data-stream processing, data-driven science

Research collaborations: I have been collaborating with industrial partners and with scientists from various disciplines since 1999. While specific challenges vary, there is always the same common theme: everybody is collecting and generating an ever increasing amount of data. In this world of big data and of data-driven science, groundbreaking discoveries depend on the ability to efficiently analyze and process these massive amounts of data. We have been designing scalable data management and analysis techniques for neuroscience, discovery and linking of personal information (e.g., as mandated by GDPR), ornithology, ecology, rocket science (really!), astronomy, and high-energy physics---to name a few.

Research

Research vision: Create algorithms that scale in the size and complexity of data, with a focus on analysis problems motivated by grand challenges in Open Data and data-driven science.

What our PhD students do: design novel algorithms; prove lower bounds, upper bonds, optimality; build big-data systems; publish results in the premier CS and domain-science venues.

Prof. Riedewald is co-founder and co-leader of the DATA Lab @ Northeastern. Currently he focuses on the development of novel techniques for large-scale distributed data analysis, data management, and data mining. His research agenda is driven by collaborations with domain scientists and industry, with the goal to produce results that are publishable in both premier computer science venues as well as those in the application domain.

Publications

Current Projects

Distributinator: Scalable Big-Data Analytics

How do we effectively and efficiently use many machines in a cluster or in a cloud to solve a big-data-analysis challenge? What is the best way to partition a dataset so that running time of the distributed computation is minimized? How do we abstract a complex distributed computation so that we can learn a mathematical model of how running time depends on parameters affecting data partitioning?

Any-k: Optimal Ranked Enumeration for Conjunctive Queries

When a query on big data produces huge output, can we quickly return the "most important" results without even computing the entire output? If the notion of importance is difficult to define, can we return the top-ranked results so quickly that the user can try out different options (nearly) interactively? For what types of queries and data can this functionality be supported? And what are the best time and space guarantees we can provide?

Why Not Yet: Algorithmic Fairness for Individuals and Groups

We are interested in exploring and correcting the impact AI tools have on our everyday lives, especially in the context of algorithmic fairness for ranking. Individuals and institutions have long used rankings for decision-making, e.g., to determine who gets a job, who is admitted to a university, which university to apply to, and even to pick the best basketball players of all time. While it is convenient to rely on data and algorithms to produce such rankings, we must establish guardrails to prevent unintended outcomes. In this project, we investigate acceptable notions and measures of fairness, and devise mechanisms for ensuring that algorithms behave accordingly. One result is a novel fairness definition based on the qualifications of individual entities. It complements previous work, which explored target ranges for the representation of groups in the top positions of a ranking. We also study techniques for explaining and debugging an undesirable ranking and the function used to assign scores to entities. Unlike rankings generated by complex AI approaches that are difficult to understand, we focus on linear functions and attempt to use powerful formal methods in a way that allows our approach to scale to big data.

Table-as-Query: unifying Data Discovery and Alignment

Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open-data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open-data sharing by laying the theoretical foundations of data discovery and by designing a system that solve the problem at scale.

NCTracer Web

How do we turn 20,000 3D image stacks (10 terabytes per mouse brain) taken by a high-resolution light microscope into a coherent 3D image of the brain? How do we extract from this massive dataset a graph representing the neurons captured in the image? And how do we analyze this graph efficiently? Can we extend this approach to include other brain data, e.g., from fMRI and electron microscopes? And can we generalize our techniques to graph problems in other domains such as social network analysis?

Selected Past Projects

Scolopax: Making Analysis of Scientific Data Fast and Easy

Cayuga: A Scalable System for Data Stream Processing

Additive Groves Prediction Technique and Automatic Interaction Detection