←

Responsible AI: Validating LLM Outputs

By Christopher Farina, PhD

Thu Dec 19 2024

Early Adopters

At inVibe, we’ve been working with AI since our founding in 2013, and we were among the first market research agencies to adopt large language models (LLMs) back in 2021, and we released our LLM tools to clients in 2023. AI isn’t a gimmick or a buzzword for us, it’s central to our founding mission: To develop technology that provides cost-effective, high-speed healthcare market research while maintaining the highest quality.

As AI takes a center stage in healthcare with the release of popular LLMs like ChatGPT and Claude, we’ve continued to integrate the latest AI models, features, and functionality into more of the work we do at inVibe. Working with AI every day, we quickly started to see the cracks and have prioritized finding ways to ensure the accuracy and integrity of AI-generated content.

Responsible Innovators

To ensure our clients have reliable AI insights, we designed and implemented an evaluation system to validate our AI tools. Our language experts and prompt engineers leverage this system to iteratively test the quality and reliability of the outputs. By testing different prompts and LLMs in this system, we’re able to ensure that our AI tools meet our clients’ needs and our high standards.

Practically speaking, we do this by having two trained language experts compare two outputs presented next to one another on one screen. Once we review the two outputs, we score each across 6 categories that are based on best practices for evaluating the quality of language data and conversational LLMs.

Completeness: To what extent does the output address everything it was asked to?
Accuracy: How closely does the output align with our understanding of the voice data? (Rating: 1-4)
Integrity: How clear is the relationship between the findings and the verbatim evidence in the output? (Rating: 1-4)
Organization: To what extent is the output presently coherently and in the prescribed way? (Rating: 1-4)
Formatting: Does the output include consistent formatting that enhances readability? (Rating: 1-2)
Citation: Are findings in the output supported by citations in the prescribed format? (Rating: 1-2)

Let’s take a look at one of these comparisons in action. In the below image, we’re seeing a quality test of two different LLMs that were prompted to report the key findings from the same set of voice data. After reviewing both outputs, the language expert scrolls down to where they assign scores to each output and select a ‘winner.’

Once we have a significant number of ratings for what we’re assessing, we compare the total scores, sub-scores, and selected winners to determine an overall winner (i.e., the model that performed better). This overall winner becomes our new baseline for this task in the next round of testing and becomes the default way that our AI tools perform this operation in our dashboard—until the next upgrade at least.

Through many iterations of this testing, we’ve improved the quality of our models by 46.73% over the last 12 months (and we expect to continue this pace in the next 12 months)! This means that our clients can rest assured that our AI tools will remain best-in-class: offering comprehensive, accurate, transparent, and comprehensible outputs tuned to market research best practices for qualitative voice data.

Service Partners

Are you interested in learning more about how our AI tools can make your qualitative research simpler, more systematic, and more scalable? Schedule a demo with us today and see for yourself!

Responsible AI: Validating LLM Outputs

Early Adopters

Responsible Innovators

Service Partners

Thanks for reading!

Introducing inVibe’s Topic Analysis Tool: Transforming Complex Voice Data into Actionable Insight Grounded in Real Human Emotion

See More, Understand More: New Quant Views on the inVibe Dashboard

Unlocking Cross-Study Insights: Introducing inVibe’s Multi-Project AI Chat for Deeper Market Research Analysis

/voice

/product

/resources

/use-cases

/company

@social