Q&A with Dr. Rupal Patel on Human and Synthetic Voices
Author: Aditi Peyush
Date: 03.31.21
Operating at the intersection of speech and technology, Dr. Rupal Patel is a professor of communication sciences and disorders at Bouve College of Health Sciences and Khoury College. She is the founder and director of the Communication Analysis and Design Laboratory (CadLab) at Northeastern and is the founder and CEO of VocaliD, a voice AI company. Khoury News sat down with Dr. Patel and asked her about her company, what motivates her, and her recent mention in Scientific American.
What are you up to right now?
Although I am on leave from my faculty role at the moment, I am doing some work with the university right now, but my primary work is off campus with VocaliD. I’m the founder and CEO of VocaliD, a voice technology company.
What does your work entail?
Can you tell us more about VocaliD?
The company was founded 5 years ago and aims to leverage the science of speech production for assistive technology or learning technologies—essentially how to make speech-based interfaces more natural sounding or adaptive and personalized. At VocaliD, we create unique-sounding AI voices for individuals with speech impairments but also for companies that want a unique brand identity for their products that talk.
What kind of work goes on at VocaliD?
Our first focus was on using our technology for people who couldn’t speak. Several hundred people are currently using VocaliD’s personalized AI voices as their daily communication: people who have lost their voice to conditions like ALS or head and neck cancer, but also children with speech disorders from a young age who were given a generic robotic voice. Now we have built a custom voice for them as a combination of whatever they could do with their voice and from the 28,000 global volunteer speakers who have contributed their voice, after finding the right match for the subjects. It’s a completely new voice because we’re not just using someone else’s voice—we’re digitally blending a new voice for them.
What motivates you in this work?
Making an impact on people’s lives. It’s really satisfying to hear stories where they’re using their new voice. I’m also excited about the broader market applications of our company’s voices being heard in more general applications. The technology has improved in the last five years since we’ve taken it out of the lab, so we’re now using VocaliD voices for different banking applications for companies that want their IVR (interactive voice response) bot to have a unique brand identity. There are also opportunities for voiceover actors in more diverse populations that aren’t really represented, like those with mild accents, urban voices, or regional voices. The messaging that we hear today is very mid-America white—Siri, Alexa, and Cortana are forty-something white females. For products to only speak to that population says something about marginalization for other voices. Now we’re giving voices to all these populations who haven’t been represented before, which I find really fulfilling.
We noticed that Scientific American recently did a story on your research on synthetic voices that included an explainer video. Congratulations! Can you tell us about the significance of their coverage, and what it could mean for your work on voice prostheses?
Thanks – it was a great story and the video is especially well done. The reporter had interviewed us back in early 2019 when she worked at Quartz, but as the organization got reconfigured the story had fallen away. She picked it up again last year through Scientific American and added more historical context and background in terms of the evolution of the technology. It’s of course great to be covered in such a high profile publication but more importantly, because the reporter also went through the process of recording herself and listening to her own AI voice, she was able to tell the story from a different angle than previous press.
We have been fortunate to have great press from TED, NPR, Wired, WSJ, BBC, and others, and each time there are new groups and audiences who hear about what is possible with all the advances in speech technology today and the ways we can impact that lives of those living with speech impairment. A side effect of coverage in Scientific American is that it may inspire more people to pursue careers that blend an aptitude for STEM with a desire to make significant societal change.
What are the goals you most want to accomplish in your work?
I want to use technology as a superpower. There are so many ways to reduce bias and improve the lives of individuals. Something as small as hearing a variety of different voices in products can start to empower individuals and groups of people to think of themselves as more central and part of a bigger conversation. That’s really important. We don’t think about the micro ways in which we generally bias products and technology, and I think that how we speak to audiences will be very important in this voice first world. The way we’re consuming information these days is by voice—we’re not reading as much anymore, we’re listening. Think of healthcare applications. If you have a bot talk to a patient in a voice that’s familiar, it could have a huge impact on the patient’s well-being. There are applications in therapeutic and medication adherence. For example, think of the benefit of people with dementia who could benefit from hearing the voice of familiar individuals. My goal is that we can use this technology as a superpower for improving people’s lives.
Why do you think your work is making a contribution?
For decades, we have focused on speech recognition, which has tons of biases. That’s improved over time with more data. Speech synthesis, however, plateaued. Now as more and more things are starting to talk to us, we have been thinking about engagement and connection. That’s both the science of what the technology can do, but it’s also the kinds of work that researchers in the basic sciences—like what I’m trained in—can bring to the table. My research has been on the melody of speech, something called prosody, and I love this word because it encompasses intonation, pitch changes, and loudness changes. So far, many kinds of technologies have focused on the content of speech, but the melody has been thought of as icing on the cake. But how you say something makes a difference in how someone perceives it and reacts to it.
What was your pathway into this field?
I didn’t know about speech pathology until my third year of college. When I learned about it, I was surprised to know how it encompasses neuroscience and the importance of communication, which really appealed to me. That’s how I started my master’s degree in speech pathology. The way I got into prosody was I was working with severely disabled individuals whose speech was really nonfunctional, so they used devices to talk. What I started noticing though—this is one of the things I love about science—is that observation guides inquiry. So, when I started working with these individuals, I noted that even though they had a device to talk, they would use their voice as much as they could, to vocalize. Initially, those vocalizations didn’t mean much, then as I got to know them, I learned what they were communicating.
When someone has a speech disorder, you can think of it as a strong accent; they use what they can in their body to manipulate it to produce sound patterns. My PhD was all about finding signal in this noise—that was how I started my career. As time went on, I wanted to learn how to incorporate using that signal to leverage what computers can do while recognizing those patterns. Communication is such a fundamental longing that everyone has, and to be able to take that natural production of speech and harness it, manipulate it, and craft it in a way that we can further enhance communication with computers is what drives me.
How has the pandemic affected your work?
The pandemic got me thinking about how we could leverage our understanding of speech production. Our voice actually tells a lot of stories about what’s going on with us physically and mentally—it’s an amazing biomarker. Even if you’re not sick with COVID, many people are socially isolated right now. Picking up on these subtle cues can serve as early markers of depression. I am less concerned with diagnosis but wonder what we can do to improve their lives. When you talk to people who are energetic, it’s contagious, and you start to feel that too. How can we build machines or interactions between people that will rub off as energy? I’m thinking of speech as an excitation pulse that could end up being as uplifting as other therapies. It’s opened up other ideas on how we can use our knowledge in a better way.