inVibe logo

Hearing and Intuiting Emotion in the Voice

By Christopher Farina, PhD

Tue Oct 08 2024

At inVibe, we use machine learning technology to recognize and measure emotionality in our respondents’ voices when they’re speaking. To humans who have spent a lifetime decoding emotion, this task seems very simple. However, getting a computer to decode emotions like a human can is a bit more complicated than it seems. To understand exactly what we’re asking computers to do, let’s explore the nature of emotions, what it means to “recognize” one, and how humans do it.

What are emotions?

When we think about emotions, the first ones that come to mind tend to be “basic” emotions like happiness, sadness, disgust, anger, fear, and surprise. Social scientists refer to these emotions as “basic” because—compared to other emotions—they’re simple. In fact, they’re so one-dimensional and consistent that children tend to recognize them as young as 3 to 4 years old—about 4 years before they reliably recognize more complex emotions like delight (a combination of happiness and surprise) or shame (a combination of fear and disgust)!

How well can we identify emotions?

After a lifetime of interacting with others, we all have intuitions about how people express their emotions. In real life, those intuitions are generated from a variety of sources: facial expressions, body language, tone of voice, and situational context, among others. What if we couldn’t rely on all of these behaviors and contextual elements? What if, like a computer, we had to recognize emotions from just the tone of voice? No context, no body language, no facial expressions, just voice. How would we do it?

Let’s take this example out of the abstract and give it a try for ourselves. Below are 5 examples of an actor saying “He told me he was moving away” with each performed with a different basic emotion (happiness, sadness, disgust, anger, or surprise). Listen to the following examples and try to intuit which audio corresponds to each emotion. Once you’re ready to see the answer, click the static below the “EMOTION” header to reveal the corresponding emotion.

Example Audio
Emotion
Example Audio
Emotion
Example Audio
Emotion
Example Audio
Emotion
Example Audio
Emotion

How’d you do? Maybe this was harder than you expected, maybe your intuitions matched all 5 correctly. Tasks like this are hard for most people—certainly harder than recognizing the emotions of someone you’re speaking with directly—and most people are only able to guess emotions at just above random chance in tasks like this (about 1 in 5 or 20%). We’re asking much more from computers, which are expected to be at least 80% accurate, identifying 4 out of 5 emotions correctly.

What are we really hearing?

Let’s explore how humans decode emotionality in speech. When the only thing that most of us are aware of is that person “sounds happy,” it’s hard to tease apart how we came to that intuition. Think about your own thought process as you tried to decode the above examples. What were you listening for? What clued you into the emotionality in the audio examples? Well, if you’re like most adults, you probably noticed differences in vocal qualities like pitch (loudness), intonation (rise and fall of pitch), pace (speech rate), fluency (smoothness), or any other of the 30-odd other human-perceptible vocal qualities. If your intuitions are anything like mine, your descriptions for how these 5 emotions sound may look something like this.

PitchIntonationPaceFluency
AngerLouderVariedFasterChoppy
DisgustNeutralFallingFasterSmooth
SadnessQuieterFallingSlowerChoppy
SurpriseLouderRisingMuch FasterChoppy
HappinessNeutralRisingSlowerSmooth

Taken together, subtle differences across these 4 vocal qualities are enough to uniquely identify these 5 basic emotions. But the differences within each feature are just that: “subtle.” Over the course of childhood, we learn to associate certain vocal qualities with particular emotions through repeated exposure in social interactions. And, through this exposure, we learn to adapt to tremendous amounts of variation by speaker and by circumstance. Take pitch for example.

  • Is my ‘quieter’ the same as yours? ‘Louder?’ ‘Neutral?’
  • Is my ‘quieter’ right now the same as it was yesterday or it will be tomorrow?
  • How much louder is a ‘loud’ emotion like surprise than a ‘neutral’ one like happiness?
  • Are these pitch contrasts consistent for speakers of American English and British English? What about speakers of French? Chinese?
  • How loud would a complex emotion like delight (surprise + happiness) be? Does it pick a quality from one emotion, take the average, something else altogether?

What about machine learning?

Whether we can articulate answers to these questions or not, we all have robust intuitions on the answers to these questions in real life conversations. The problem becomes then “How do we move beyond intuition? How can we prove our intuitions are accurate?” At inVibe, we solve these problems by using Speech Emotion Recognition (SER) technology to measure emotionality in the voice. In our next blog, I’ll discuss what SER is, how it works, and how we use it to power our linguistic insights.

If you’re interested in learning more about inVibe, our listening-centered approach to research, or how we can support your business objectives, contact us to learn more!

Thanks for reading!

Be sure to subscribe to stay up to date on the latest news & research coming from the experts at inVibe Labs.

Recently Published

Responsible AI: Validating LLM Outputs

By Christopher Farina, PhD

Thu Dec 19 2024

How we ask: How inVibe writes voice response questions

By Tripp Maloney

Thu Dec 12 2024

In Support of Crohn's & Colitis Awareness Week (Dec. 1-7)

By Christopher Farina, PhD

Wed Dec 04 2024

/voice

  1. Why Voice
  2. Sociolinguistic Analysis
  3. Speech Emotion Recognition
  4. Actionable Insights
  5. Whitepapers
  6. The Patient Voice

@social

2024 inVibe - All Rights Reserved