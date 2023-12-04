BENGALURU - For a few weeks this year, villagers in the Indian state of Karnataka read out dozens of sentences in their native Kannada language into an app as part of a project to build the country’s first AI-based chatbot for Tuberculosis.

There are more than 40 million native Kannada speakers in India, and it is one of the country’s 22 official languages and one of over 121 languages spoken by 10,000 people or more in the world’s most populous nation.

But few of these languages are covered by natural language processing, the branch of artificial intelligence that enables computers to understand text and spoken words.

Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities.

“For AI tools to work for everyone, they need to also cater to people who don’t speak English or French or Spanish,” said Ms Kalika Bali, principal researcher at Microsoft Research India.

“But if we had to collect as much data in Indian languages as went into a large language model like GPT, we’d be waiting another 10 years. So what we can do is create layers on top of generative AI models such as ChatGPT or Llama,” Ms Bali told the Thomson Reuters Foundation.

The villagers in Karnataka are among thousands of speakers of different Indian languages generating speech data for tech firm Karya, which is building datasets for firms such as Microsoft and Google to use in AI models for education, healthcare and other services.

The Indian government, which aims to deliver more services digitally, is also building language datasets through Bhashini, an AI-led language translation system that is creating open source datasets in local languages for creating AI tools.

The platform includes a crowdsourcing initiative for people to contribute sentences in various languages, validate audio or text transcribed by others, translate texts and label images. Tens of thousands of Indians have contributed to Bhashini.

“The government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism and in the courts,” said Mr Pushpak Bhattacharyya, head of the Computation for Indian Language Technology Lab in Mumbai.

“But there are many challenges: Indian languages mainly have an oral tradition, electronic records are not plentiful, and there is a lot of code mixing. Also, to collect data in less common languages is hard, and requires a special effort.”

Economic Value

Of the more than 7,000 living languages in the world, fewer than 100 are captured in major databases, with English the most advanced.

The launch of ChatGPT last year triggered a wave of interest in generative AI. ChatGPT is trained primarily on English. Google’s Bard is limited to English, and of the nine languages that Amazon’s Alexa can respond to, only three are non-European; Arabic, Hindi and Japanese.