Science and technology | Polyglot machines

Why AI needs to learn new languages

Efforts are under way to make AI fluent in more than just English

A digital globe with a characters from a number of different languages in the background
Illustration: Nick Kempton
|Abu Dhabi and Chennai

ChatGPT, a chatbot developed by OpenAI, an American firm, can give passable answers to questions on everything from nuclear engineering to Stoic philosophy. Or at least, it can in English. The latest version, ChatGPT-4, scored 85% on a common question-and-answer test. In other languages it is less impressive. When taking the test in Telugu, an Indian language spoken by nearly 100m people, for instance, it scored just 62%.

OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive. Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English. In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more. Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.

Explore more

This article appeared in the Science & technology section of the print edition under the headline "Sending AI to language school"

How the border could cost Biden the election

From the January 27th 2024 edition

Discover stories from this section and more in the list of contents

Explore the edition

More from Science and technology

Archaeologists identify the birthplace of the mysterious Yamnaya

The ancient culture, which transformed Europe, was also less murderous than once thought

Producing fake information is getting easier

But that’s not the whole story, when it comes to AI


Disinformation is on the rise. How does it work?

Understanding it will lead to better ways to fight it