ChatGPT Is Cutting Non-English Languages Out of the AI Revolution

Computer scientist Pascale Fung can imagine a rosy future in which polyglot AI helpers like ChatGPT bridge language barriers. In that world, Indonesian store owners fluent only in local dialects might reach new shoppers by listing their products online in English. “It can open opportunities,” Fung says—then pauses. She’s spotted the bias in her vision of a more interconnected future: The AI-aided shopping would be one-sided, because few Americans would bother to use AI translation to help research products advertised in Indonesian. “Americans are not incentivized to learn another language,” she says.

Not every American fits that description—about one in five speak another language at home—but the dominance of English in global commerce is real. Fung, director of the Center for AI Research at the Hong Kong University of Science and Technology, who herself speaks seven languages, sees this bias in her own field. “If you don’t publish papers in English, you’re not relevant,” she says. “Non-English speakers tend to be punished professionally.”

Fung would like to see AI change that, not further reinforce the primacy of English. She’s part of a global community of AI researchers testing the language skills of ChatGPT and its rival chatbots and sounding the alarm about evidence that they are significantly less capable in languages other than English.

Although researchers have identified some potential fixes, the mostly English-spewing chatbots spread. “One of my biggest concerns is we’re going to exacerbate the bias for English and English speakers,” says Thien Huu Nguyen, a University of Oregon computer scientist who’s also been on the case against skewed chatbots. “People are going to follow the norm and not think about their own identities or culture. It kills diversity. It kills innovation.”

At least 15 research papers posted this year on the preprint server, including studies co-authored by Nguyen and Fung, have probed the multilingualism of large language models, the breed of AI software powering experiences such as ChatGPT. The methodologies vary, but their findings fall in line: The AI systems are good at translating other languages into English, but they struggle with rewriting English into other languages—especially those, like Korean, with non-Latin scripts.

Not only AI scholars are worried. At a US congressional hearing this month, Senator Alex Padilla of California asked Sam Altman, CEO of ChatGPT’s creator, OpenAI, which is based in the state, what his company is doing to close the language gap. About 44 percent of Californians speak a language other than English. Altman said he hoped to partner with governments and other organizations to acquire data sets that would bolster ChatGPT’s’s language skills and broaden its benefits to “as wide of a group as possible.”

Padilla, who also speaks Spanish, is skeptical about the systems delivering equitable linguistic outcomes without big shifts in strategies by their developers. “These new technologies hold great promise for access to information, education, and enhanced communication, and we must ensure that language doesn’t become a barrier to these benefits,” he says.

OpenAI hasn’t hid the fact its systems are biased. The company’s report card on GPT-4, its most advanced language model, which is available to paying users of ChatGPT, states that the majority of the underlying data came from English and that the company’s efforts to fine-tune and study the performance of the model primarily focused on English “with a US-centric point of view.” Or as a staff member wrote last December on the company’s support forum, after a user asked if OpenAI would add Spanish support to ChatGPT, “Any good Spanish results are a bonus.” OpenAI declined to comment for this story.

Jessica Forde, a computer science doctoral student at Brown University has criticized OpenAI for not thoroughly evaluating GPT-4’s capabilities in other languages before releasing it. She’s among the researchers who would like companies to publicly explain their training data and track their progress on multilingual support. “English has been so cemented because people have been saying (and studying), can this perform like a lawyer in English or a doctor in English? Can this produce a comedy in English? But they aren’t asking the same about other languages,” she says. 

Large language models work with words using statistical patterns learned from billions of words of text grabbed from the internet, books, and other resources. More of those available materials are in English and Chinese than in other languages, due to US economic dominance and China’s huge population. 

Because text data sets also have some other languages mixed in, the models do pick up capability in other languages. Their knowledge just isn’t necessarily comprehensive. As researchers at the Center for Democracy and Technology in Washington, DC, explained in a paper this month, because of the dominance of English, “a multilingual model might associate the word dove in all languages with peace even though the Basque word for dove (‘uso’) can be an insult.”

Aleyda Solis encountered that weakness when she tried Microsoft’s Bing chat, a search tool that relies on GPT-4. The Bing bot provided her the appropriate colloquial term for sneakers in several English-speaking countries (“trainers” in the UK, “joggers” in parts of Australia) but failed to provide regionally appropriate terms when asked in Spanish for the local footwear lingo across Latin America (“Zapatillas deportivas” for Spain, “championes” for Uruguay). 

In a separate dialog, when queried in English, Bing chat correctly identified Thailand as the rumored location for the next setting of the TV show White Lotus, but provided “somewhere in Asia” when the query was translated to Spanish, says Solis, who runs a consultancy called Orainti that helps websites increase visits from search engines.

Executives at Microsoft, OpenAI, and Google working on chatbots have said users can counteract poor responses by adding more detailed instructions to their queries. Without explicit guidance, chatbots’ bias to fall back on English speech and English-speaking perspectives can be strong. Just ask Veruska Anconitano, another search engine optimization expert, who splits her time between Italy and Ireland. She found asking Bing chat questions in Italian drew answers in English unless she specified “Answer me in Italian.” In different chat, Anconitano says, Bing assumed she wanted the Japanese prompt 元気ですか (“How are you?”) rendered into English rather than continuing the conversation in Japanese.

Recent research papers have validated the anecdotal findings of people running into the limits of Bing chat and its brethren. Zheng-Xin Yong, a doctoral student at Brown University also studying multilingual language models, says he and his collaborators found in one study that generating better answers for Chinese questions required asking them in English, rather than Chinese.

When Fung at Hong Kong and her collaborators tried asking ChatGPT to translate 30 sentences, it correctly rendered 28 from Indonesian into English, but only 19 in the other direction, suggesting that monoglot Americans who turn to the bot to make deals with Indonesian merchants would struggle. The same limited, one-way fluency was found to repeat across at least five other languages.

Large language models’ language problems make them difficult to trust for anyone venturing past English, and maybe Chinese. When I sought to translate ancient Sanskrit hymns through ChatGPT as part of an experiment in using AI to accelerate wedding planning, the results seemed plausible enough to add into a ceremony script. But I had no idea whether I could rely on them or would be laughed off the stage by elders.

Researchers who spoke to WIRED do see some signs of improvement. When Google created its PaLM 2 language model, released this month, it made an effort to increase the non-English training data for over 100 languages. The model recognizes idioms in German and Swahili, jokes in Japanese, and cleans up grammar in Indonesian, Google says, and it recognizes regional variations better than prior models.

But in consumer services, Google is keeping PaLM 2 caged. Its chatbot Bard is powered by PaLM 2 but only works in US English, Japanese, and Korean. A writing assistant for Gmail that uses PaLM 2 only supports English. It takes time to officially support a language by conducting testing and applying filters to ensure the system isn’t generating toxic content. Google did not make an all-out investment to launch many languages from the beginning, though it’s working to rapidly add more.

As well as calling out the failings of language models, researchers are creating new data sets of non-English text to try to accelerate the development of truly multilingual models. Fung’s group is curating Indonesian-language data for training models, while Yong’s multi-university team is doing the same for Southeast Asian languages. They’re following the path of groups targeting African languages and Latin American dialects.

“We want to think about our relationship with Big Tech as collaborative rather than adversarial,” says Skyler Wang, a sociologist of technology and AI at UC Berkeley who is collaborating with Yong. “There are a lot of resources that can be shared.”

But collecting more data is unlikely to be enough, because the reams of English text are so large—and still growing. Though it carries the risk of eliminating cultural nuances, some researchers believe companies will have to generate synthetic data—for example, by using intermediary languages such as Mandarin or English to bridge translations between languages with limited training materials. “If we start from scratch, we will never have enough data in other languages,” says Nguyen at the University of Oregon. “If you want to ask about a scientific issue, you do it in English. Same thing in finance.”

Nguyen would also like to see AI developers be more attentive to what data sets they feed into their models and how it affects each step in the building process, not just the ultimate responses. So far, what languages have ended up in models has been a “random process,” Nguyen says. More rigorous controls to reach certain thresholds of content for each language—as Google tried to do with PaLM—could boost the quality of non-English output. 

Fung has given up on using ChatGPT and other tools born out of large language models for any purpose beyond research. Their speech too often comes off as boring to her. Due to the underlying technology’s design, the chatbots’ utterances are “the average of what’s on the internet,” she says—a calculation that works best in English, and leaves responses in other tongues lacking spice.

Source : Wired