←

Measuring and Identifying Emotion in the Voice

By Christopher Farina, PhD

Tue Oct 15 2024

In our last post, I discussed what emotions are and how we can identify emotions by recognizing subtle differences in vocal qualities. I’ll be focusing here on what Speech Emotion Recognition (SER) is, how computers can reliably measure emotions, and how we use SER to power our linguistic insights.

What is SER?

Speech emotion recognition (SER) is an application of AI that uses machine learning models to measure and classify emotions as conveyed in the voice itself—not in the words spoken. Our SER model takes into account vocal qualities that humans can easily distinguish (like pitch, tone, speed, loudness) and others that anyone would struggle to articulate but may be able to intuit with some training (like signal energy, voice quality, voice onset, vowel formant frequencies). These vocal qualities then get laddered up to three acoustic measures that have been shown to be more robust, more adaptable, and more accurate compared to measuring emotions directly: activation, valence, and dominance.

Activation: The level of intensity or excitement in the voice. Increased activation indicates greater enthusiasm for a topic
Valence: The level of positivity or negativity in the voice. Increased valence indicates more pleasant feelings toward a topic
Dominance: The control and assuredness communicated by the voice. Increased dominance indicates a more confident way of speaking about a topic

Taken together, these three measures allow us to measure underlying emotions encoded in the voice and compare differences in emotion and in their intensity.

Accounting for natural variation

In order to accurately capture the emotionality cross-culturally, our model has been trained on large and diverse datasets that capture a wide range of emotional expressions from tens of thousands of speakers across dozens of languages and cultures. This rich dataset allows our model to perform in a way that is ‘speaker-independent,’ focusing on universal aspects of emotional expression instead of those specific to any individual speaker or speech community. More specifically, our model measures about 40 acoustic features, maps their contours statistically, and outputs normalized scores to ensure direct comparability across speakers.

To ensure that each speaker’s emotionality is reliable, we also measure each respondent’s score relative to a baseline that we collect from them at the beginning of the voice survey. Relativizing scores like this ensures that any emotionality we measure in the voice corresponds exclusively to the topic that the respondent is discussing. That is, if someone who is already frustrated responds to our survey is relatively happy about a topic in the survey that they’re reacting to, the model alone would output somewhat frustrated. This is because the vocal qualities of frustration remain in their voice, coloring their emotional measures. Relativizing the scores effectively resets their emotional baseline so that the model recognizes their emotionality as happy.

Measuring emotion in practice

Let’s turn now to an example where we uncover the underlying emotions in physician responses. In this example, we asked physicians to share their reactions to clinical trial data released at a professional oncology conference. I know it doesn’t sound like the kind of thing that one would get overly emotional about, but let’s take a listen and then a look.

First, let’s listen to two typical responses from oncologists reacting to two different abstracts. The first HCP (pink) is reacting to Abstract 1, which covers the results of a phase 1 study of a novel molecule. The second HCP (green) is reacting to Abstract 2, which covers phase 2 data from a novel combination involving two familiar molecules.

Both responses cover many of the same topics: “interest” in the results, concerns about sample size and generalizability, how this data compares to standard of care, and potential adoption into practice. However, I’m sure you heard a subtle difference between the two HCPs. With acoustic analysis, some of our intuitions about these differences in the HCPs’ voices can be plotted and shown plainly.

These two charts (called circumplexes) illustrate the different levels of enthusiasm (activation; y-axis) and positivity (valence; x-axis) of each respondent’s voice as they react to Abstract 1 (left) and Abstract 2 (right). Comparing these two charts, we see that responses to Abstract 1 cluster in the lower left quadrants and responses to Abstract 2 in the upper right quadrants. These differences in clustering indicate that HCPs are more enthusiastic about Abstract 2 than they are about Abstract 1. In this case, enthusiasm for a treatment indicates a greater willingness to adopt it once it is approved by the FDA.

Measuring emotion, understanding behavior

Acoustic analysis is a potent tool in the belts of our language experts that helps our clients to understand the thoughts, beliefs, attitudes that explain and motivate behavior. If you’re interested in learning more about how inVibe can help you to collect and analyze the voices of your key stakeholders (patients, caregivers, HCPs, payors, and more!), contact a member of our Strategy team today.

Measuring and Identifying Emotion in the Voice

Thanks for reading!

Introducing inVibe’s Topic Analysis Tool: Transforming Complex Voice Data into Actionable Insight Grounded in Real Human Emotion

See More, Understand More: New Quant Views on the inVibe Dashboard

Unlocking Cross-Study Insights: Introducing inVibe’s Multi-Project AI Chat for Deeper Market Research Analysis

/voice

/product

/resources

/use-cases

/company

@social