At inVibe, we use machine learning technology to recognize and measure emotionality in our respondents’ voices when they’re speaking. To humans who have spent a lifetime decoding emotion, this task seems very simple. However, getting a computer to decode emotions like a human can is a bit more complicated than it seems. To understand exactly what we’re asking computers to do, let’s explore the nature of emotions, what it means to “recognize” one, and how humans do it.
What are emotions?
When we think about emotions, the first ones that come to mind tend to be “basic” emotions like happiness, sadness, disgust, anger, fear, and surprise. Social scientists refer to these emotions as “basic” because—compared to other emotions—they’re simple. In fact, they’re so one-dimensional and consistent that children tend to recognize them as young as 3 to 4 years old—about 4 years before they reliably recognize more complex emotions like delight (a combination of happiness and surprise) or shame (a combination of fear and disgust)!
How well can we identify emotions?
After a lifetime of interacting with others, we all have intuitions about how people express their emotions. In real life, those intuitions are generated from a variety of sources: facial expressions, body language, tone of voice, and situational context, among others. What if we couldn’t rely on all of these behaviors and contextual elements? What if, like a computer, we had to recognize emotions from just the tone of voice? No context, no body language, no facial expressions, just voice. How would we do it?
Let’s take this example out of the abstract and give it a try for ourselves. Below are 5 examples of an actor saying “He told me he was moving away” with each performed with a different basic emotion (happiness, sadness, disgust, anger, or surprise). Listen to the following examples and try to intuit which audio corresponds to each emotion. Once you’re ready to see the answer, click the static below the “EMOTION” header to reveal the corresponding emotion.
How’d you do? Maybe this was harder than you expected, maybe your intuitions matched all 5 correctly. Tasks like this are hard for most people—certainly harder than recognizing the emotions of someone you’re speaking with directly—and most people are only able to guess emotions at just above random chance in tasks like this (about 1 in 5 or 20%). We’re asking much more from computers, which are expected to be at least 80% accurate, identifying 4 out of 5 emotions correctly.
What are we really hearing?
Let’s explore how humans decode emotionality in speech. When the only thing that most of us are aware of is that person “sounds happy,” it’s hard to tease apart how we came to that intuition. Think about your own thought process as you tried to decode the above examples. What were you listening for? What clued you into the emotionality in the audio examples? Well, if you’re like most adults, you probably noticed differences in vocal qualities like pitch (loudness), intonation (rise and fall of pitch), pace (speech rate), fluency (smoothness), or any other of the 30-odd other human-perceptible vocal qualities. If your intuitions are anything like mine, your descriptions for how these 5 emotions sound may look something like this.
Pitch | Intonation | Pace | Fluency | |
---|---|---|---|---|
Anger | Louder | Varied | Faster | Choppy |
Disgust | Neutral | Falling | Faster | Smooth |
Sadness | Quieter | Falling | Slower | Choppy |
Surprise | Louder | Rising | Much Faster | Choppy |
Happiness | Neutral | Rising | Slower | Smooth |
Taken together, subtle differences across these 4 vocal qualities are enough to uniquely identify these 5 basic emotions. But the differences within each feature are just that: “subtle.” Over the course of childhood, we learn to associate certain vocal qualities with particular emotions through repeated exposure in social interactions. And, through this exposure, we learn to adapt to tremendous amounts of variation by speaker and by circumstance. Take pitch for example.
- Is my ‘quieter’ the same as yours? ‘Louder?’ ‘Neutral?’
- Is my ‘quieter’ right now the same as it was yesterday or it will be tomorrow?
- How much louder is a ‘loud’ emotion like surprise than a ‘neutral’ one like happiness?
- Are these pitch contrasts consistent for speakers of American English and British English? What about speakers of French? Chinese?
- How loud would a complex emotion like delight (surprise + happiness) be? Does it pick a quality from one emotion, take the average, something else altogether?
What about machine learning?
Whether we can articulate answers to these questions or not, we all have robust intuitions on the answers to these questions in real life conversations. The problem becomes then “How do we move beyond intuition? How can we prove our intuitions are accurate?” At inVibe, we solve these problems by using Speech Emotion Recognition (SER) technology to measure emotionality in the voice. In our next blog, I’ll discuss what SER is, how it works, and how we use it to power our linguistic insights.
If you’re interested in learning more about inVibe, our listening-centered approach to research, or how we can support your business objectives, contact us to learn more!