Anyone who has conducted research will likely agree that the process of collecting data is one of the biggest hurdles to completing the project on time and on budget, which can especially be challenging for traditional qualitative studies. Data collection often demands large-scale efforts to compile relatively small datasets with findings that some regularly dismiss for lacking statistical power.
Now, the snowballing popularity of AI-driven approaches to conducting research has given rise to a new concept to entice those who seek a lower friction method of obtaining qualitative insights: synthetic qualitative data.
What is “synthetic” data?
You may be wondering if you read that correctly. Does “synthetic,” mean manufactured? In short, yes. As in not spoken, written, or otherwise communicated by living, breathing participants? Again, yes. Synthetic qualitative data, one of the latest applications of large language models (LLMs), seeks to supplement—or even supplant—qualitative datasets comprised of real responses collected from real people.
The basic process of generating synthetic qualitative data is alluringly simple. First, a researcher feeds a small sample of existing real-world data into an LLM—such as ChatGPT or Claude.ai. Then, the model generates new machine-authored outputs that—ideally—mirror the complexity and quality of the original data. While such a concept may be especially tantalizing in the high-speed field of market research—where large-scale access to consumers’ voices is always needed and timelines are short—it is critical that we pause and dip our toes in to test these unknown waters before jumping headfirst into them.
Increasing data quantity
There are cases in which synthetic data can be used ethically and effectively. Early touters of synthetic qualitative data and its potential value have already begun to define ground rules and best practices. They tend to emphasize reserving synthetic data for situations where no other viable options exist, such as:
- Insurmountable data privacy concerns
- Difficult-to-find or non-existent target audiences
- Limitations on cost and time
- Need for testing a new methodology or design aspect
However, these guidelines currently inhabit the realm of theory rather than reality, tending to be offered without concrete examples of how they have led to successful outcomes in actual qualitative studies. This, of course, raises the question of whether proposed guidelines will yield results comparable to those from a larger sample of human respondents.
Let’s assume that synthetic responses can be generated to mirror the complexity and quality of the real responses from which they’re generated. How, then, can we determine that they are not—at best—an amplification of the smaller, human sample’s original distribution of sentiments and perceptions? In other words, how can synthetic responses reliably achieve the intended goal of predicting new sentiments and perceptions that researchers might have heard if they conducted interviews with a larger sample? To answer these questions, several studies have attempted to demonstrate that predominantly synthesized samples consistently lead to similar findings as a fully human sample of the same size. So far, they’ve had less-than-promising results.
The point here is not to completely dismiss the idea of using synthetic data. Some data is better than no data when cost, time, or other constraints make existing methodologies impossible—though we all agree it is always preferable to obtain data from human participants. However, until guidelines and best practices are developed and proven in the field, the best course of action should be to proceed with caution.
Improving data quality
Considering again the early ground rules for ethical and effective usage of synthetic data, we at inVibe see potential for synthetic responses when testing aspects of our research designs. For example, our unique methodology requires us to carefully craft our survey prompts to elicit responses that are detailed and on-topic. To amplify the power of our validated question library in this question design process, we can use a custom LLM to generate synthetic responses iteratively on draft questions, which will allow us to predict which versions are expected to lead to the best responses from our human respondents. In effect, synthetic responses would allow us to conduct rapid pre-tests, ensuring that our recommended design will elicit the high-quality responses that our clients have come to expect.
inVibe’s approach
While new use-cases will likely continue to emerge as various industries adopt and experiment with synthetic data, inVibe’s one-of-a-kind methodology bypasses the need to conduct actual analysis on synthetic responses. Specifically, our HIPAA-compliant anonymization and secure data storage procedures eliminate concerns about data privacy. Our panel partnerships connect us to a vast network of stakeholder audiences, making even the rarest audiences accessible. And lastly, our automated voice response platform allows us to conduct frictionless, scalable collection of all-human responses in a cost- and time-efficient manner. So, if you need to talk to real stakeholders but can’t figure out how to reach them in the numbers you require, contact inVibe—we’ll be more than ready to help!