AI's Matthew effect

In a small cafe on a narrow street in Bengaluru, a young artist named Gauri attempts to use a popular artificial intelligence (AI) image generator to design a poster for an upcoming festival. She types a prompt in Kannada, her mother tongue, describing a scene from a local festival in vivid detail, spending well over five minutes. The AI returns a confused jumble: a generic “Indian” tableau with mismatched colours and odd symbols that have nothing to do with the festival. She tries again, but this time in English — a language in which she is far less comfortable due to her limited proficiency — and within seconds, a nuanced, photorealistic image appears.

Gauri’s experience is not a glitch but a symptom of the deepening global divide. In the age of artificial intelligence, linguistic inequality is hardening. Those who know English, the digitally "rich," reap ever greater benefits from AI, whilst those who do not are increasingly left behind.

Beyond the user experience, the imbalance is also evident in the development process itself. In research conducted by me and my team, integrating Kannada — a language that is widely spoken in southern India — into an AI model required approximately two months of work and 16 GB of curated training data. The process was further constrained by limited infrastructure and gaps in available linguistic data, which made development slower and less reliable. This is not just an isolated case, but is indicative of a broader pattern in how low-resource languages are treated in AI development, revealing the structural barriers that continue to limit their inclusion.

The Matthew effect, a sociological concept coined by sociologists Robert K. Merton and Harriet Zuckerman in 1968, describes how advantage accumulates over time. They posit that those who begin with more resources in society continue to gain, while those with less fall further behind. This is the Matthew effect in AI: to the linguistically privileged, more will be given and to those who are not, even basic access to such tools will begin to slip away.

The data advantage behind linguistic inequality

This phenomenon is starkly visible in India. India is one of the world’s most linguistically diverse nations, home to twenty-two official languages and hundreds of dialects. For urban, English-speaking users, these tools excel and with regional languages like Tamil, Telugu, or Kannada, they stutter and fail. This is because many AI tools — from voice assistants to content generators — are built on the foundations of training data dominated by English, which, according to data from W3Techs, makes up over half of the content on the internet.

The technical heart of the problem lies in the data. AI models are like children — they learn from what they are given. As researcher Britta Schneider argues in her 2022 work “Multilingualism and AI: The Regimentation of Language in the Age of Digital Capitalism” these models are largely trained on massive, machine-readable corpora dominated by English and Mandarin. This creates a reinforcing cycle: better-resourced languages produce more digital content, training better AI, which then encourages even more production in those same languages. Low-resource languages, lacking extensive digital text and speech datasets, are left in a perpetual cold start. The results are measurable. Based on research on speech recognition bias by Koenecke et al. (2020), systems perform significantly worse for non-native and non-standard English accents, a pattern that extends to contexts such as Dravidian-accented speech in southern India, thereby systematically penalising rural and “non-elite” users.

Beyond inconvenience: culture and opportunities at stake

The consequences go beyond mere inconvenience and actually reinforce social and economic inequality. Consider education. Indian platforms like the National Programme on Technology Enhanced Learning (NPTEL), the Study Webs of Active–Learning for Young Aspiring Minds (SWAYAM), and others could use AI-powered personalisation to tailor lessons for individual learners — offering explanations in students’ preferred language, adapting content difficulty, and providing real-time feedback. If these tools work reliably only in English or Hindi, they will accelerate the advantage enjoyed by privileged students (often urban), while leaving around half of learners outside those two dominant languages without access to equally effective tools.

In media and culture, AI risks homogenisation. Translation models collapse rich, regional synonyms for a word like “water” into a single standardised term. Inaccurate or oversimplified translations flatten culture and erase dialectal identity. Over time, that erosion pushes certain words and phrases out of everyday use; for instance, the cultural and contextual nuances embedded in each variation are lost — a broader limitation highlighted in multilingual AI research by Google. This is particularly concerning for already vulnerable languages. Tulu, a regional language spoken by around two million people in southern India, illustrates what is at risk: it has limited digital presence, making it susceptible to being misrepresented in AI systems despite its rich cultural and oral traditions.

What comes next in policy

In India, the issue has become unavoidably political. Recognising this, Prime Minister Modi’s government has launched initiatives like Digital India Bhashini, to build AI capabilities for all of India’s languages in an attempt to rectify the situation.

It is a state-level acknowledgement that, left to market forces alone, AI will likely reinforce a linguistic hierarchy, one that does not reflect population distribution. According to the 2011 Census of India, Hindi is spoken by 43 per cent of the population, English by roughly 10 per cent, and the remaining population forms a long tail of other regional languages. This is a crucial political intervention to steer the technological progress toward national inclusion rather than deepen the divide.

India’s challenge is also a global one. The U.S., as the epicentre of AI development, exports models and platforms built primarily for the Anglo-centred audience. U.S. corporate priorities, profit, and efficiency often align poorly with the painstaking, community-focused work needed to uplift low-resource languages. There is no major U.S. policy equivalent to Bhashini that pushes for equitable multilingual AI. Without the global pressure, the digital language divide widens; such a large burden is treated as a “local” problem for developing nations like India to solve.

The choice is clear: either we build AI that acts as a bridge between languages, or we accept it as a filter that amplifies dominant tongues. To avoid a smarter, but culturally poorer, future, we must consciously design digital infrastructure for inclusion and, through that, mitigate the Matthew effect in AI.