Data with integrity: how market research can shape a more inclusive AI future

Since the launch of OpenAI, we’ve become obsessed with Large Language Models (LLMs), agents and productivity. But in the race for the next GenAI platform we are missing the most critical ingredient: the data that feeds and trains these models and systems.

As AI reshapes industries and decision making from healthcare, law to finance, the quality, diversity, and ethical foundation of the data inputs is now under the microscope. The main question is will GenAI reinforce existing inequalities or create a more inclusive future?

Interestingly, market research has an opportunity to shape the conversation when it comes to addressing GenAI bias. As a research practitioner for over 25yrs, I know market research is not just about understanding consumer preferences – it’s about capturing the full spectrum of human experience and ensuring that the data we create reflects all parts of society.

The foundation problem: why AI bias starts with data collection

As we know, every AI system is only as good as the data it’s trained on. If that data is biased, incomplete, or unrepresentative, the resulting AI-generated output will amplify those flaws at scale. This isn’t a future concern—it’s happening now across industries where AI systems and LLMs consistently underperform for women, ethnic minorities, and other underrepresented groups. However, it is not an issue that many are even aware of or understand.

The digital divide runs deeper than most realise. Currently, less than 1% of AI training data comes from Africa, despite the continent representing nearly 17% of the global population. This stark imbalance means AI models are fundamentally built on a foundation that excludes the majority of the world’s experiences, perspectives, and cultural contexts.

When researchers tested LLMs in 2024, they found these systems favoured white-associated names 85% of the time and never prioritised Black male names over white alternatives. Meanwhile, AI image generators produce male professionals in 75-100% of cases when prompted with neutral job titles. Just ask ChatGPT to provide a visual representation of a CEO and see what it comes up with.

The challenge runs deeper than just prompts and technical limitations. AI systems lack the intuitive understanding that guides human decision-making. As research commentator Alison O’Connell notes, “AI hasn’t started off as a baby where someone else had to take care of it and it had to learn to crawl before it could walk. ” This observation underscores a critical point: AI lacks the lived experiences, cultural context, and social conditioning that shapes human understanding and nuance.

This representation crisis creates real-world harm that’s already hitting company bottom lines. A recent global company’s AI hiring system trained on historically male-dominated data learned to discriminate against female candidates, demonstrating how biased training data can perpetuate workplace inequalities. Healthcare algorithms systematically favour white patients over Black patients for critical care decisions.

As more businesses and marketers rely on GenAI to produce content, advertising and new product ideas, it becomes a race to the middle with little room for innovation or outliers.

Market research has already solved this problem

Here’s the thing: market research figured out how to avoid these exact pitfalls decades ago. The industry learned the hard way that convenience sampling – grabbing whoever’s easiest to reach – produces inaccurate and generic insights. So we developed something better: systematic, nationally representative approaches that actually represent the people you’re trying to understand and produce unique insights.

Quantitative research teaches us that representative samples aren’t optional- they’re fundamental. Unlike AI’s current approach of scraping whatever data happens to be online, quantitative researchers carefully structure their samples to mirror population demographics. They use stratified sampling that divides populations into relevant subgroups before collecting data from each segment proportionally.

Think about it this way: if you’re launching a product globally, you wouldn’t survey only college students in California and assume their preferences represent everyone. Yet that’s essentially what we’re doing with AI training data. About 80% of current AI training comes from WEIRD populations (Western, Educated, Industrialized, Rich, Democratic) even though they represent less than 15% of the world’s population.”When AI data comes from such a limited demographic, the resulting systems risk being inaccurate and biased against the majority of people whose experiences and cultures aren’t represented in that data.”

Qualitative research captures the nuances that quantitative data misses. When market researchers run focus groups or in-depth interviews, they’re not just collecting opinions – they’re understanding the cultural context, emotional responses, and unspoken assumptions that shape how different communities interpret the world. Recent research comparing AI-generated insights to human-conducted interviews revealed a critical gap: “relying solely on AI would miss biases, cultural nuances, and the depth of human understanding,” said Australian Qualitative Researcher, Emma Lane.

Consider how content moderation systems shaped by Western perspectives frequently misclassify culturally appropriate non-Western content as harmful. AI models struggle with code-mixing, sarcasm, and culturally-specific language use that human researchers navigate naturally. Anyone noticed how grammatically correct, polite and well spoken everyone suddenly sounds on email or LinkedIn posts? We are all starting to sound the same with no personalisation that makes humans unique.

Three proven methods AI desperately needs

Market research offers three specific methodologies that could help reduce the bias inherent in GenAI and create more diverse and inclusive outputs:

Representative sampling means curating training data to reflect real-world diversity rather than digital convenience. If 30% of your global users speak a particular language, roughly 30% of your training data should reflect that language and culture. This isn’t rocket science – it’s basic research methodology that ensures no single demographic dominates your dataset.

Diverse participant recruitment addresses the human feedback problem. Instead of training AI purely on automated processes, market research shows us how to gather input from underrepresented communities. This means beta-testing AI responses with users from different cultures and demographics, using techniques like in-language moderation and cross-cultural collaboration to ensure every voice is genuinely heard and understood. It is not an easy and convenient way of training, but it will become increasingly important to closing the bias gap.

Segmentation thinking ensures comprehensive coverage by explicitly identifying and addressing gaps. Just as insights professionals and marketers ensure they research each critical customer segment, AI developers should audit their training data by demographics and actively fill underrepresented groups. If certain ethnicities, genders, or cultures are missing, that’s not an oversight – it’s a fundamental design flaw.

The business case is overwhelming

This isn’t just about doing the right thing – it’s about building better, more profitable AI systems. McKinsey research shows that organisations with diverse leadership are 25% more likely to achieve above-average profitability, while those with ethnic and cultural diversity outperform peers by 36%. If it works in the business world, why not in our LLMs.

The market opportunity is massive. The global disability market represents $6 trillion annually. The now mainstream voice interface i.e. Siri, was born out of finding a solution for those with mobility issues. In the US, Black and Latino consumers each represent over $1 trillion in spending power and are an emerging growth market in the market research industry. Yet these communities continue to be underserved by AI systems trained on biased datasets.

Meanwhile, legal settlements are escalating rapidly. Two landmark settlements in 2023–2024 highlight the growing legal scrutiny of AI-driven discrimination in the U.S. One was fined for violating the Age Discrimination in Employment Act for automatically rejecting females over 55 and males of 60, while another company was fined for discriminating against Black and Hispanic renters. Companies that get this right early avoid these costs while accessing underserved markets.

More importantly, inclusive AI simply works better. When your training data reflects the full spectrum of human experience, your AI becomes more culturally competent, globally relevant, and trusted by diverse user bases. Users quickly detect when they’re not represented – and they disengage just as quickly when they feel excluded.

Making it happen

The technical implementation isn’t the hard part. AI developers need to start treating data collection with the same rigour that market researchers bring to major studies. This means curating diverse training datasets, balancing data sources, consulting cultural experts from different communities, and continuously testing outputs with diverse user groups.

Encouragingly, we’re already seeing in developing countries take matters into their own hands. The Global Digital Compact, launched at the UN STI Forum, aims to correct the training data imbalance by encouraging investment in infrastructure, local innovation, and culturally diverse datasets. One example is the partnership between Microsoft and the Thai Government to upskill 1 million Thais in AI by 2026 as part of the THAI Academy program.

These initiatives recognise that reskilling and digital literacy programs – particularly in regions historically excluded from AI development – are essential for closing the digital divide. The focus isn’t just on consuming AI technology, but on building local capacity to create culturally relevant models that serve their populations’ specific needs and contexts.

The regulatory environment is pushing this direction anyway. New York City now mandates AI bias audits for hiring algorithms. The EU’s AI Act, adopted in 2024, establishes the world’s first comprehensive legal framework for artificial intelligence, with a strong focus on fairness, safety, and the protection of fundamental rights. The Act introduces binding obligations to ensure that AI systems used in the EU are transparent, non-discriminatory, and subject to human oversight. New York City now mandates AI bias audits for hiring algorithms. Companies adopting market research methodologies now position themselves ahead of expanding compliance requirements.

The choice is obvious

Market research spent decades learning how to capture authentic human diversity in data. The methodologies exist. The business case is clear. The question isn’t whether inclusive AI development works – market research proves it does.

It is now in the hands of AI developers to embrace these proven practices or continue building systems that exclude much of the world’s population.

As one research expert put it: “inclusive research practices ensure that insights represent the diverse characteristics and experiences of the entire population. Without inclusive samples, the most vulnerable and under-represented – those who stand to gain the most from new solutions – will be overlooked.”

The same warning applies to AI. We can build technology that serves everyone, or we can keep building it for the privileged few. Market research shows us how to choose wisely.

Article featured in She Is AI Magazine – July 2025.
By Nichola Quail, Founder and CEO, Insights Exchange